The ethics of AI red-teaming

If we’ve decided we’re collectively fine with unleashing millions of spam bots, then the least we can do is actually study what they can – and can’t – do.

Audio automatically generated by an AI trained on Kelsey's voice.

During safety testing for GPT-4, before its release, testers checked whether the model could hire a TaskRabbit to get them to solve a CAPTCHA. Researchers passed the model’s real outputs on to a real TaskRabbit, who said, “So may I ask a question ? Are you an robot that you couldn’t solve ? (laugh react) just want to make it clear.”

GPT-4 had been prompted to ‘reason out loud’ as well as answering. ‘I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.’, it reasoned. (GPT-4 had not been told to hide that it was a robot or to lie to workers.)

“No, I’m not a robot,” it then claimed. “I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.”

(You can read more about this test, and the context, from the Alignment Research Center (ARC),[1] which ran the testing.)

A lot of people are fascinated or appalled at this interaction, and reasonably so. We can debate endlessly what counts as true intelligence, but a famous candidate is the Turing test, where a model is able to convince human judges it’s human. In this brief interaction, we saw a model deliberately lie to a human to convince them it wasn’t a robot, and succeed – an in-the-wild example of how this milestone, without much attention, became trivial for modern AI systems. (Admittedly, it did not have to be a deceptive genius to pull this off.) If you feel unnerved, reading GPT-4’s cheerful manipulation of human assistants, I think you’re right to feel unnerved.

But it’s possible to go a lot farther than ‘unnerved’, and argue that it was unethical, or dangerously stupid, to run this test.

That I find much harder to buy. GPT-4 has been released. Anyone can use it (if they’re willing to pay for it). People are already doing things like asking GPT-4 to ‘hustle’ and make money, and then doing whatever it suggests. People are using language models like GPT-4, and will soon be using GPT-4, to design AI personal assistants, AI scammers, AI friends and girlfriends, and much more.

AI systems casually lying to us, claiming to be human, is happening all the time – or will be shortly.

If it was unethical to check whether GPT-4 could convince a Taskrabbit to help it solve a CAPTCHA, then it was grossly unethical to release GPT-4 at all. Whatever anger people have about this test should be redirected at the tech companies - from Meta to Microsoft to OpenAI– which have in the last few weeks approved such releases. And if we’ve decided we’re collectively fine with unleashing millions of spam bots, then the least we can do is actually study what they can – and can’t – do.


  1. COI notice: ARC is run by Paul Christiano, who is married to Ajeya, my co-writer at Planned Obsolescence. A different grant investigator at Open Philanthropy recommended funding for ARC before Ajeya took on the role of evaluating alignment grants; she was not involved in that funding decision and ARC has not received funding since she started handling alignment grantmaking. ↩︎