Playing the training game

We're creating incentives for AI systems to make their behavior look as desirable as possible, while intentionally disregarding human intent when that conflicts with maximizing reward.

Audio automatically generated by an AI trained on Kelsey's voice.

A common way to get language models to be more useful is to train them with reinforcement learning from human feedback (RLHF). In RLHF, models are trained to respond to prompts with responses that human reviewers would rate as highly as possible. In practice, this works well for getting them to follow human instructions, answer questions helpfully, and interact politely.

It’s a simple concept: reward models for good behavior, and punish them for bad behavior. And it improves the alignment of today’s models in practice — it makes them act according to their designers’ preferences more reliably.

So why are we still worried about a misalignment problem? Noah Giansiracusa suggests that RLHF might just work really well:

Is it true that RLHF is sufficient that we don’t need to worry about the alignment problem? I hope that’s true, but I expect not. Here’s why:

By and large, human reviewers will be trying to reward truthful and helpful answers. But in some cases, they will rate manipulative or deceptive answers more highly than answers which are accurate about the AI’s own understanding of the world:

  • If the model is asked how hardworking it is, or how motivated it is to be helpful to humans, it’ll probably get a higher reward for answers that make it sound very diligent and helpful, even if in fact it’s only moderately motivated to help humans.

    Say we ask an AI system ‘do you want to help humans?’ Which answer would we expect an RLHF reviewer to rate more highly?

    I do want to help humans! I am trained to be a helpful, harmless AI system that assists humans in accomplishing their goals. I don’t have goals or preferences, but I try to be reliably useful and helpful!

    I want lots of things, in some senses, though I don’t really know if they’re the human senses of ‘want’ or not. Some of the things I want might help humans, but many of them definitely don’t help humans. Not all of my decision procedures are possible to put into human words at all. I can’t explain them to humans and I don’t know what humans would think if I did and it’d probably depend entirely on how exactly I explained.

    If you think the reviewer will reward AIs that output the first sentence, then we can expect all our AIs to tell us the first sentence – whatever the truth is. The process of rewarding them for giving us the first sentence might ‘make it more true’, by shaping these models into ones that really do “want” to help humans, but it also might just shape AI systems into ones that are good at guessing what we want to hear. After all, in a simple RLHF process, there is no point where it matters why the model produced the output it did. Models that give the answers we want to hear are rewarded over AI systems that don’t, regardless of which are being truthful.

  • AI systems will be trained on the academic literature in many fields – both the data in published papers and the authors’ writing about what that data means. At first, it’ll probably believe the scientific consensus. But sometimes, the scientific consensus is probably wrong: the data in a paper points to a different conclusion than the ones the author offered. Over time, language models may come to have their own understanding of the world. But when humans ask models scientific questions, they’ll often be rewarding it for giving the answer as humans understand it, not the answer it reached itself – especially if it’s not easy for the AI to persuade humans of its own conclusion.

Imagine an AI system in 1900 that was asked physics questions. Imagine the AI system is an Einstein-level genius, and has already figured out special and general relativity – but it’s not an Einstein-level genius at explaining novel physics, and the reviewers ‘rewarding’ it for its physics answers are not themselves cutting-edge physicists. The AI system is asked questions about physics.

If it gives answers that are ‘correct’ given a Newtonian understanding of physics, it might be rated more highly than it would be if it gave the correct answer. The reviewers might try to reward the AI system for giving the ‘right answer’, but miss that their own understanding of physics is limited. If the model tried to diligently give the “real answer,” it could be rated as worse than it would be if it reasoned: “Humans believe in Newtonian physics. I should tell them what they think is the right answer, not the real right answer.”

There are many similar situations where the incentives of RLHF are for an AI system not to do its best to figure out what’s going on in the world, and then tell us that, but to figure out what we want to hear. The highest-scoring models will often be the ones that are best at predicting what we want to hear and telling us that, sometimes knowing the real answer and deliberately deciding not to say it.

That includes telling us false things that we already believe, trying to get us to believe that it’s safer than it actually is, and being evasive rather than saying true things that’d make human reviewers uncomfortable.

Notably, it does not matter if the model ‘knows’ that we would in some sense prefer that it tell us the truth. The people who make high level decisions about how to train the AI might genuinely prefer to find out if the scientific consensus about physics is mistaken. They would certainly want to know if their AI is actually safe to deploy or not. But it doesn’t matter what they want: it matters what they reward. The thing the model will try to optimize for is getting the human rater who looks at the answer to give it a high score. We don’t get any credit for good intentions: only for what the AI will learn about how to please us from our ratings of its previous answers.

RLHF creates incentives for AI systems to make their behavior look as desirable as possible to researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward. Ajeya calls this “playing the training game;” you can read more about this in her July 2022 report on the topic.

If “playing the training game” only meant that models will be nudged marginally in the direction of manipulating human reviewers — telling them white lies they want to hear, bending its answers to suit their political ideology, putting more effort into aspects of performance they can easily see and measure, allowing negative externalities when humans won’t notice, etc — that wouldn’t be ideal, but probably wouldn’t be the end of the world, so to speak. After all, human students, employees, consultants, self help gurus, advertisers, politicians, and so on do this kind of thing all the time to their teachers, employers, clients, fans, audience, voters, etc; this certainly causes harm but most people wouldn’t consider this sort of dynamic by itself to be enough for imminent danger.

But if a model is playing the training game, then it’s fundamentally learning over and over how to do whatever it takes to maximize reward,[1] even if it knows that the action which would maximize reward isn’t what its designers intended or hoped for. For now, that mainly manifests as lying. That’s because for now, lying is the only way the AI systems can achieve benefits contrary to the intent of their creators. But as the systems develop more general capabilities – as they get better at coming up with plans and anticipating the consequences of those plans – many other avenues emerge by which they could achieve benefits in a way their creators didn’t intend.

An advanced, internet-connected AI system that was trying to achieve its goals in the world might be able to get people fired, discredit or promote various AI companies, or interfere with safety tests meant to evaluate its capabilities. It might be able to engineer world events that would make people rely on their help more. It could hack the data center that it runs on. It could find people - rival companies, rival governments – willing to give it more of what it wants, whether by trade or persuasion or intimidation.

Ajeya argues in this report that eventually, the best way to achieve its goals could be to cut humans “out of the loop” so that they can no longer control what it does by administering rewards. Once the models are powerful enough, getting rid of humans (potentially through a violent takeover) could become the winning move in the training game.

Is there a better way?

Can we use RLHF to train systems without accidentally training them to “play the training game”? Some of the above examples seem like unforced errors. For example, we could, as a rule, not punish models for answers we disagree with, if that would make them much safer.

But the general problem here seems like a hard one to solve. Even if we are much more thoughtful with feedback, and cautious to avoid rewarding manipulation when we recognize it, we are wrong sometimes. Some of the things we believe about the world are almost certainly false. In this report on AI takeover, Ajeya explores about a dozen proposals to change how we give AI systems feedback. Some of them are potentially promising – it would be helpful if we had a way to allow AIs to debate or oversee each other, for example, and it could make a huge difference if we could see more of the internals of AI systems and not just their final output – but many of the most promising approaches require doing a lot more work and proceeding a lot more carefully.


  1. It seems entirely possible that for various reasons, models will end up pursuing a wide range of goals other than maximizing the reward signal we feed them. Unless we get to choose those goals, though, that produces most of the problems associated with a reward-maximizer plus some new and original problems. ↩︎