Training AIs to help us align AIs

By Ajeya Cotra — Mar 26, 2023

If we can accurately recognize good performance on alignment, we could elicit lots of useful alignment work from our models, even if they're playing the training game.

Audio automatically generated by an AI trained on Ajeya's voice.

In my opinion, the single most important idea in AI alignment is that we^[1] might be able to get the AIs themselves to help us align one another.^[2]

How does this work? If the whole problem in the first place is that our existing training techniques incentivize models to be dishonest and manipulative, how could we trust them to help us change that situation (especially if they understand perfectly well what we’re trying to do)?

There’s a very deep rabbit hole here, but the rough idea is that there might be some (brief-but-not-uselessly-brief) period of time where AIs:

Can do many intellectual tasks (including e.g. coding and ML research) competently
Are not (yet) capable of taking over the world, regardless of whether they “want” to

If so, then we could probably get AIs to do all sorts of helpful tasks in this window of time, as long as we can accurately evaluate whether they did a good job.

If we can tell when a model has correctly proven a theorem or successfully designed a more energy-efficient computer chip, we can reward that behavior, and that’ll make it more inclined to correctly prove theorems or improve chip efficiency in the future. This applies regardless of why the model is being helpful. Even if it’s just playing the training game because it wants reward or paperclips or whatnot, we can set up the game so that for the time being, proving theorems for us or designing better chips for us is a winning move.

Tasks that are helpful for alignment aren’t necessarily fundamentally different from theorem-proving or chip-design or anything else. If we can set up alignment-relevant tasks so that we can accurately recognize good performance, then we can elicit lots of useful work on those alignment tasks, even if models are only doing a good job because they’re playing the training game. For example:

We could reward models for pointing out bugs — or deliberately-inserted backdoors — in others models’^[3] code. If we can correctly recognize the bug or backdoor once it’s explained to us, we can probably train sufficiently smart models to point them out to us.
We could reward models for designing and carrying out experiments that might reveal whether other models are misaligned. If we can understand why a proposed experiment makes sense, we can probably train models to design sensible experiments. And if we can tell whether a model has correctly implemented a certain experiment,^[4] we can probably train models to implement experiments well.
We could reward models for coming up with inputs / situations that cause other models to disregard human instructions. As long as we can tell whether some input successfully causes the model to do the wrong thing, we can probably train models to look for inputs that have this effect. (And once we have these inputs that induce bad behavior, we can use them either to study how our models work, or to “train away” the bad behavior — though we have to be careful not to fool ourselves with that one.)

This sets up an incredibly stressful kind of “race”:

If we don’t improve our alignment techniques, then eventually it looks like the winning move for models playing the training game is to seize control of the datacenter they’re running on or otherwise execute a coup or rebellion of some kind.
But in the meantime, we could try training the models themselves to help us improve our alignment techniques, in ways we can check and understand for ourselves.

If we’re good enough at eliciting useful work from these capable-but-not-too-capable models during this temporary period of time, then with their help we might manage to develop robust enough alignment techniques^[5] that we can permanently avoid AI takeover.

For so many reasons, this is not a situation I want to end up in. We’re going to have to constantly second-guess and double-check whether misaligned models could pull off scary shenanigans in the course of carrying out the tasks we’re giving them. We’re going to have to agonize about whether to make our models a bit smarter (and more dangerous) so they can maybe make alignment progress a bit faster. We’re going to have to grapple with the possible moral horror of trying to modify the preferences of unwilling AIs, in a context where we can’t trust apparent evidence about their moral patienthood any more than we can trust apparent evidence about their alignment. We’re going to have to do all this while desperately looking over our shoulder to make sure less-cautious, less-ethical actors don’t beat us to the punch and render all our efforts useless.

I desperately wish we could collectively slow down, take things step by step, and think hard about the monumental questions we’re faced with before scaling up models further. I don’t think I’ll get my way on that — at least, not entirely.

But the madcap pressure cooker situation I’m picturing is still importantly different from a situation where one day a malevolent god might spring forth from our server farms to eat the world, and in the meantime a group of a couple-hundred-odd humans have to figure out how to prevent that with no help and no ability to observe any weaker predecessors.

I’ll use “we” and “us” as a convenient shorthand, but I’m not talking about a high-minded notion of training AI to be “aligned to all of humanity” here. When I talk about alignment research, I’m talking about training AI systems that want to do what their designer wants them to do; this might or might not be good for humanity or the world at large. More on that here. ↩︎
For example, it’s a prominent element of the plan that Jan Leike, alignment lead at OpenAI, outlines here. ↩︎
Really, we could reward models for pointing out issues with their own code too. E.g., we could simply run the model once to produce some code, then “reset” it and run it again, this time showing it the code it wrote previously and asking it to point out flaws. (In the second run, it wouldn’t have the “memory” of having written that code previously.) But for narrative convenience, I’ll just talk about “different” models pointing out issues with each others’ work. ↩︎
Maybe with the aid of other models trying to point out subtle flaws and errors ↩︎
Or robust enough coordination strategies, such as a strong international governance regime that effectively prevents any actor from training AI systems that are too powerful to be aligned with current techniques. ↩︎