Alignment researchers disagree a lot

Many fellow alignment researchers may be operating under radically different assumptions from you.

Audio automatically generated by an AI trained on Ajeya's voice.

The way I use the term,[1] “AI alignment research” is technical research that’s trying to develop ways of building powerful AI systems so that they “try their best” to do what their designers want them to do, and never try to deliberately circumvent or disregard their designers’ intent. If an AI system is trying to do what its designer intends (as best it can), we say it’s “aligned;” otherwise, it’s “misaligned.”

This is unfortunately an inherently fuzzy and slippery definition for a technical research field. For one thing, it’s debatable to what extent it makes sense to talk about today’s AI systems “trying” to do anything “deliberately” — there are no clear-cut observations that could tell us that an AI’s behavior was “intentional,” and we don’t currently have the tools to straightforwardly “look inside its head” to learn what it was thinking. For another, it’s not totally obvious how much of a problem there is or how serious it is[2] — if we train increasingly intelligent AI systems in straightforward ways using existing techniques, they may simply turn out to be aligned by default (and if they’re technically “misaligned,” the consequences may not be dire).

This makes AI alignment a tricky and frustrating area to work in and engage with. It’s a pre-paradigmatic field — there’s very little shared foundation that researchers can lay bricks on top of right now. Self-described AI alignment researchers have vastly different ideas about how to define the alignment problem, how likely misalignment is in the first place, what the consequences of powerful misaligned AI are likely to be and how bad they are, what kind of research is vitally important and what kind of research is a distracting waste of time. Some self-described alignment researchers would say that other self-described alignment researchers are actively destroying the world. Some feel that they’ve already made massive progress on problems that others think of as core difficulties which haven’t been seriously tackled. Some create benchmarks to measure some kind of alignment-related phenomena or demonstrate an alignment problem, and others think that these benchmarks and demos are all missing the heart of the problem.

Like anyone else who thinks about this area, I’m coming from my own particular perspective, and not everyone who’s working on (what they would call) “AI alignment” will agree with my view. For example, I care about AI alignment research primarily because I think that there’s a good chance that advanced AI systems will take over the world in a possibly-violent uprising or coup unless we make more research progress; many but not all people working on the relevant technical research problems share this picture.[3] As another example, I think that it’s likely that we could soon train systems capable enough to obsolete humans by straightforwardly applying existing deep learning techniques, and I believe there’s a lot of productive alignment research to be done from within the deep learning paradigm (again, many but not all people in the broader alignment field agree with this).

I think that the field as a whole could be much more effective if different camps could arrive at a workable set of common assumptions that could be built upon. Failing that, it would still be progress if each of the various “camps” could more clearly articulate its implicit assumptions, so the disagreements between them can become starker and more easily understood by newcomers to the field.

I’m cautiously optimistic that the field can move in this direction (and I hope this blog can help somewhat), but I think it’ll be a very long and difficult process and I’m not sure how well it’ll work. In the meantime, if you’re a new alignment researcher, it’s worth keeping in mind that many fellow researchers in this field may be operating under radically different assumptions from you — to the point where your research might be unintelligible to them and vice versa.

  1. A number of Open Phil grantees and collaborators use the term in the same way, but that’s far from universal — there are very few universally-agreed standard definitions and concepts in this area right now. ↩︎

  2. If we blindly train AI in the easiest possible way without paying any special attention to alignment issues, I still think there’s only a ~75% chance we’ll run into a problem. And if we make a reasonable, careful effort to avoid it, I’m not sure how quickly the problem goes away. ↩︎

  3. Depending on how much research we consider “relevant,” the vast majority of relevant researchers may think that my threat model is far-fetched. ↩︎