"Aligned" shouldn't be a synonym for "good"

By Ajeya Cotra — Mar 26, 2023

Perfect alignment just means that AI systems won’t want to deliberately disregard their designers' intent; it's not enough to ensure AI is good for the world.

Audio automatically generated by an AI trained on Ajeya's voice.

Productive forward progress on technical research benefits from a clear technical problem statement. Here is an example of a good technical problem statement: “How can we design a nuclear weapon such that if an explosive is set off in contact with the bomb, there’s a less than 1 in a million chance that the bomb produces a nuclear explosive yield of greater than 4 pounds of TNT-equivalent?”

Here is an example of a terrible technical problem statement: “How can we design a nuclear weapon such that it’s good for the world, rather than bad?” Nuclear physicists can’t run a bunch of experiments, and prove a bunch of theorems, and thereby produce a “good” nuke (whatever that means). Moreover, they don’t (and shouldn’t!) have the legitimate authority to make a call about what kind of nuke is a “good” nuke in the first place.

Similarly, I sometimes see people using “the alignment problem” to mean “the problem of how to build an AI such that, when you run it, it’s good for the world and not bad.”^[1] I think this is also misguided (though it’s less clear-cut). AI researchers most likely can’t do experiments and prove theorems and write code and thereby guarantee that we only produce “good” AIs — and even if they could, I don’t think it’s their place to make that decision on society’s behalf.^[2]

Unfortunately, alignment as a field is in its infancy, and we’re pretty far from having well-posed technical problems to work on, but I think we can still productively narrow our scope.

The cloud of research people are currently working on under the umbrella of “alignment” will probably decompose into several reasonable technical problems. The subset of research problems I’m most interested in is:

Can we find ways of developing powerful AI systems such that (to the extent that they’re “trying” to do anything or “want” anything at all), they’re always “trying their best” to do what their designers want them to do, and “really want” to be helpful to their designers?

That’s what I mean when I talk about alignment. (Even if you disagree with this usage, it’s useful to clarify that I and many others^[3] use the term “alignment” in this narrower way.)

Perfect alignment techniques (as I use the term) would allow a company like Google to train very smart models^[4] so that if Google asks them to increase ad revenue or optimize TPU load balancing or sell all their customers’ private info or censor YouTube videos criticizing Xi Jinping, then they’d be fully motivated to do all those things and whatever else Google asks them to do (good or bad!).^[5] With perfect alignment techniques, Google would be able to instill in its AIs a pure desire to be as helpful as possible to Google.^[6]

I care a lot about alignment research because I think that without better alignment, we could have a full-blown AI takeover on our hands (like in The Terminator). Without access to good alignment techniques, Google could end up training AI systems that are motivated by “maximizing long-run reward” (which would incentivize them to seize control from humans so they can set their own rewards) or by their own alien long-term goals (which probably also incentivize the whole seizing control thing). Whether or not you like big tech companies, research that helps them avoid accidentally unleashing a digital alien invasion^[7] seems good. (Though advocacy and regulation to stop companies from training models that could take over the world also seems good — we don’t have to choose one or the other.)

It’s still a pretty squishy problem statement, but I think it’s more of a start. There’s some hope that this is the kind of problem that can and should be solved by proving theorems and running experiments and writing code.

And by that same token, developing perfect alignment techniques is obviously not enough to ensure that AI is good for the world. Perfect alignment techniques just mean that AI systems won’t want to deliberately disregard the desires of whatever humans designed and trained them. Potentially important issues this would not necessarily solve:

Misuse: We could have perfect alignment techniques, and Kim Jong-un could use those techniques to train AIs that are aligned to him and faithfully help him surveil and crush dissidents, indefinitely extend his natural lifespan, develop superweapons to invade South Korea…
Inequality: We could have perfect alignment techniques, and companies could train incredibly profitable AIs with those techniques that put most ordinary people out of work and extraordinarily enrich and empower a small group of capital-holders.
High-stakes errors:^[8] We could have perfect alignment techniques, and an AI aligned to the US military may honestly come to believe that China launched a first-strike nuclear attack due to faulty sensor readings, and decide to fire its own nukes in retaliation.
- …including errors about what a human wanted: Or it could make a well-calibrated estimate that there was a 35% chance that the sensors could be faulty, but mistakenly guess that the President wouldn’t want it to wait any longer to double-check.
Chain-of-command issues: We could have perfect alignment techniques, but e.g. the President (or someone else entrusted with “legitimate authority” somehow) may not know how to apply those techniques directly, so they may have to hope that their underlings with technical know-how will actually train the AI as instructed.
Security: We could have perfect alignment techniques, but if hackers can gain access to an AI training process, they could derail training so the AI does not end up properly aligned to the humans who nominally control it. (Or they could simply steal the AI system and use it to do bad things, which would be a misuse issue.)
Global stability: We could have perfect alignment techniques, and conflict over which nations should get to develop powerful aligned AI could still spark World War 3. Or aligned AIs could invent new military technology that destroys the logic of nuclear deterrence.
Preference aggregation: We could have perfect alignment techniques, and there would still be questions about what procedures we should use to train or instruct AI systems whose decisions might have binding impacts on large numbers of people with mutually conflicting preferences. (Sometimes this is referred to as “many-to-many AI alignment,” but I think the way we deal with this probably looks less like “AI research” and more like “law and policy and norm-setting.”)
Philosophy and meta-ethics: We could have perfect alignment techniques, but even if AIs genuinely want to help us, they may not be able to decide what it is we really value for us. We may still be faced with tough questions about what the good life is, and what a good universe looks like.

You might read all this and think “Wow, there’s a lot to do and this alignment thing people harp on is just a small piece of the story.” And…yeah, I basically think we’re hurtling toward an insane world we’re not ready for, we have a huge amount to do that we mostly haven’t gotten started on, and we’ll probably neglect or mess up almost all of it.

I think if we develop perfect alignment techniques, the future will be ~20% better or something like that. That’s shockingly high-stakes for a problem that could maybe possibly be solved by doing science to it.^[9] I’m obsessed with alignment because I think it’s the best way that I can make the world go better with my career. But different issues need different intellectual communities tackling them, and I don’t think it’s productive to cram the entire project of building a flourishing and just future under the banner of “AI alignment research.”

E.g. here: “The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.” Or here: “The overall problem of alignment is the problem of, for an Artificial General Intelligence with potentially superhuman capabilities, making sure that the AGI does not use these capabilities to do things that humanity would not want.” ↩︎
That’s what, well, all of the institutions in all of society are for (companies making decisions under constraints imposed by governments and influences created by shareholders, activists, pundits…). I expect a number of readers would think this is a horrible system, and depending on what they mean I might agree. But I think we shouldn’t replace it with “let whichever ML researchers happened to choose to work in the ‘AI alignment’ field decide.” ↩︎
A number of Open Phil grantees and collaborators use the term in the same way, but that’s far from universal -- there are very few universally-agreed standard definitions and concepts in this area right now. ↩︎
Usually, I’m thinking about advanced AI systems trained in broadly the same way that existing powerful and general ML models like GPT-Ns and GATOs and Codexes are trained, although sometimes I support alignment research that’s not premised on deep-learning-based AI systems. I often use “model” and “AI system” interchangeably. ↩︎
In a normal way that reflects the AI systems’ full understanding of how normal humans use words — without any Literal Genie shenanigans like blowing up all the TPUs so that the load across TPUs is technically “perfectly balanced.” (Note that “being way too literal” is not a failure mode I’m particularly worried about; I think AI systems will interpret normal English instructions in sane ways by default, without particular effort toward alignment techniques.) ↩︎
This might sound kind of creepy and house elf like. And it might be. I have really conflicted feelings about this personally, but after reflecting on it, I still overall support most particular types of alignment research we could do today. ↩︎
Most likely, ideal alignment techniques would do more than just prevent AI takeover — they’d also (e.g.) prevent models from simply “not trying their best because they don’t care that much” or “insisting on speaking in iambic pentameter even if that’s slightly more confusing.” That’s not the main reason I think it’s important to support the research, but I think “get AIs to want to be helpful” is probably a better technical problem statement than “get AIs not to take over the world.” For one thing, the latter could be solved by only building weak AIs that wouldn’t be capable of taking over the world, which feels unsustainable to me. ↩︎
There is different technical AI research that we could do to reduce the risk of high-stakes errors (just not AI alignment research as I use the term). ↩︎
And I happen to think it’s a more important problem quantitatively speaking than a number of other problems on the list above, but I understand and expect disagreement there. ↩︎