Planned Obsolescence

Science and speculation

Ajeya Cotra — Fri, 01 May 2026 14:02:19 GMT

Science is a social institution like the judicial system. Just like courts make you throw out perfectly good Bayesian evidence like whether the defendant seems like a shifty guy to you or rumors you heard about his whereabouts,1 the scientific process has epistemically conservative norms about what kinds of evidence and arguments are admissible. This means that the process of science gets to the truth much slower than a perfectly rational Bayesian should, and indeed much slower than many particular farseeing individuals actually did. Svante Arrhenius made the first estimate of global warming from human carbon dioxide emissions back in 1896, probably about eighty years before something like a “scientific consensus” emerged and ninety four years before the first IPCC Assessment Report.2

But they say if you want to go fast go alone, if you want to go far go together. At its best, the point of the scientific process is to add bricks to a solid foundation of facts that we can establish “beyond a reasonable doubt,” that we can all agree on regardless of our priors because the weight of the evidence we’ve built up is enough to overwhelm those differences. This is a good piece of social technology to have. We’re never going to agree on subtle questions of priors, and it’s great to have a process that lets us at least agree about where the giant likelihood ratios lie. And like our adversarial justice system, scientific institutions are to a significant extent designed around the fact that we all have biases and conflicts of interest — personal virtue isn’t as verifiable or scalable as specific norms like preregistration, reproducibility, and peer review.

This is why, when AI Snake Oil (now known as AI As a Normal Technology) wrote AI existential risk probabilities are too unreliable to inform policy two years ago, I was much more sympathetic to it than most people who are as concerned as I am about existential risks from AI. While I strongly disagree with the authors on the object level, my guess is that loosening the norms around how much policymaking should be based on science is much more likely to lead to vaccine bans and other destructive crackpottery than thoughtfully crafted AI safety regulation.3

The arguments about superintelligence and the Singularity are speculative. “Speculative” should not be a pejorative term meaning “obviously wrong and dismissible” (Arrhenius’ arguments about CO₂ emissions were speculative), but it does make sense that society draws a normative distinction between speculative and scientifically established claims.

Unfortunately, these conservative norms have a serious chance of getting us killed in the case of AI. The pace of change in the field is vastly outstripping our ability to develop a well-grounded scientific picture of it. Just last year, METR generated a lot of buzz with an RCT showing that early 2025 AI tools actually seemed to make open source developers a little slower at their work tasks. By the time my colleagues started looking at late 2025 AI tools, the experiment design broke because participants started dropping out of the study or changing what they worked on because they were so afraid of being randomized into the no-AI arm for their normal tasks.

Still, the project of scientific research on models’ capabilities and risks doesn’t seem doomed. We could have lived in a world where superintelligence truly came as a bolt from the blue, like a lot of the speculative writing from ten or fifteen years ago contemplated. The world we find ourselves in is extremely far from that. We used to debate questions like whether people would make AI agents at all, or whether they would be connected to the internet, or whether they would be allowed to run on their own without a human in the loop. We know the answers to all those questions now. We have seen thousands of public examples of AI agents lying, cheating, and working around safeguards to try to achieve their task. With every new model there are new researchers who take catastrophic risks seriously. And precisely because we are so far from a bulletproof consensus on anything and the field moves at such a breakneck pace, there is so much grist for research and testing and sensemaking. The scientific fruit is on the floor.

We almost certainly won’t be able to develop an evidence base about AI risks anywhere near as robust as what we have for greenhouse gases causing climate change or tobacco causing lung cancer. But science is a process made out of humans, and a “scientific consensus” is ultimately a set of facts the relevant scientists managed to agree about. This develops through some messy combination of building up the foundation of established facts through shared experience and solid experiments, making logical arguments on top of established facts, arguing with one another to establish the right frames to think about things, observing whose predictions turn out to be more right, and much more. I think we do have a shot at building a real-life scientific consensus robust enough to motivate serious technical standards before it’s too late. AI is going to keep improving at a breakneck pace, and in some ways that makes this an easier problem than tobacco or climate change. Scientific consensus, like everything else, could be dragged along at the speed of AI.

See ChatGPT on hearsay.

See ChatGPT on the history of climate change, especially the table at the end.

The same goes for courts — there are many cases where courts make the wrong object-level judgment, including many where the conservative epistemic norms of the system are directly responsible for the error. But the alternative to a conservative, adversarial judicial system is probably not a speedy and reliable Bayesian justice system but rather more latitude for the state to punish people who are politically inconvenient.

Six milestones for AI automation

Ajeya Cotra — Fri, 03 Apr 2026 14:03:26 GMT

The closer you think you are to your “powerful AI” milestone of choice, the more important it becomes to define it precisely. If your timelines are in the few-year range, changing the particular definition of powerful AI can easily double or triple your median forecast. Take my definition of “full automation of AI R&D” from my predictions post earlier this year:

[F]or whichever AI company is in the lead [at the relevant point in time], if you fired all its members of technical staff,1 its rate of technical progress on relevant benchmarks would be slowed down by less than 25% (this is a pretty arbitrary threshold, you can make the milestone more or less extreme by choosing a smaller or bigger number).

There’s a lot that could be made more precise about this definition, but let’s zoom in on that arbitrary quantitative threshold.

In every major sector of economic activity, the productivity hit you take from removing humans is and has always been 100%. If suddenly no human could do any of the work people currently do in the agriculture sector, our plows and seed drills and combine harvesters wouldn’t till and plant and tend and harvest entirely on their own. We have many more machines to help us than we did in 1000 AD, but if we magically prevented all humans from doing any farming work, all our machines would just sit there and we would starve just as surely as our ancestors would have.

One salient milestone is the very first time the hit from removing humans is smaller than 100% in a given sector — the first time that machines can just barely produce output in that sector, painstakingly limping along by themselves without any humans to operate them. Let’s call this milestone adequacy. At this point, removing humans would still result in a catastrophic hit to output, but it’s nonetheless an unprecedented leap that unaided machines can make forward progress at some non-zero pace.2

Note that the threshold of adequacy is defined in terms of the productivity hit you take from removing humans, which is unrelated to the productivity hit you take from removing the machines. These do not have to add up to 100% because humans and machines are complements. Our tractors and weeders and tillers add a huge amount to agricultural productivity, and our yields would probably plummet by well over 90% if we had to give them all up and farm like medieval peasants. But they are still not adequate at farming autonomously, because getting rid of the humans would cut yields by 100%.

While AI agents are clearly adding positively to productivity at AI companies, I think they are not yet at the adequacy milestone for AI R&D. If all humans were magically prevented from doing any AI research or engineering work, and we replaced them all with giant teams of AI agents, I think the agents would likely get catastrophically stuck, and R&D progress would grind to a halt within weeks.

But this is not entirely obvious! We have AI coding agents that can stay running for a long time, that can spin up subagents and give them instructions, that can write notes to other agents…we may already have all the ingredients in place. Maybe tens of thousands of these agents churning together writing huge piles of horrendous slop code would eventually, over years, advance the frontier of AI research a bit.3 Who knows? We won’t easily be able to tell when we first reach adequacy, because no one would bother trying to hand off research to AIs at the first point when they might barely be adequate at it.

Once AIs reach adequacy in a certain sector, they will keep improving. The next interesting milestone is parity — the first point when getting rid of the AIs slows down progress in the sector more than getting rid of all the humans. The first time parity is reached, it’s likely that humans will still be better than AIs at many important subtasks, so getting rid of them and forcing the AIs to do those tasks instead would result in a substantial hit to productivity. It’s just that human labor as a whole is less important for productivity than AI labor.

When I defined “full automation of AI R&D” as the point when productivity only drops 25% or less from removing humans, I was gesturing at AI systems that are somewhat better than parity in the AI R&D sector.

Beyond parity, we can talk about supremacy — the first point when productivity in a given sector would actually increase from removing humans. In other words, a major sector of the economy would become like chess. If for some reason you wanted to win as many chess games as possible per dollar, unaided computers would crush human-computer teams right now. A human-AI team can technically draw against an AI opponent, if the human defers to their AI teammate about every move. But if you care at all about resource constraints,4 the human is pure deadweight. Supremacy is the point where it’s no longer worth the overhead to deal with humans or the cost of the food they need to stay alive.5

In theory, we can talk separately about the entire spectrum from adequacy through parity to supremacy for any sector, from agriculture to pharmaceuticals to finance. But two domains are likely to set the pace for all the rest: AI research and AI production (that is, the entire stack of chips, fabs, fab equipment manufacturers, power plants, and everything else that goes into producing and running AI systems). If we cross the three automation thresholds with these two domains, we get six milestones. Here are my current best guesses for when we cross these milestones:

We seem very close to AI research adequacy already. As I said, I wouldn’t be totally shocked if you told me this milestone had already passed, and I expect it to happen in the next couple years. From there, it seems like another couple of years could take us to AI research parity. When we’re at AI research parity, we’ll effectively have millions of researchers working on advancing AI, instead of thousands as we do today. I think this would massively accelerate the pace of AI progress, leading to AI research supremacy within another year.

The AIs that come out the end of this process would likely be able to pick up new skills with orders of magnitude less data than we need to provide systems today, including the skills required to operate physical actuators like robots and the skills involved in mining, electrical engineering, semiconductor manufacturing, and everything else involved in maintaining the AI stack. From there, reaching AI production adequacy and then parity and supremacy seems like it would mainly be a matter of rolling out AI to enough places and manufacturing enough actuators for it to operate.

Because AI production supremacy requires AIs and robots to master an extremely diverse range of physical and scientific skills, I think it also entails supremacy on practically any objectively-measurable physical contest. For any goal defined in terms of outcomes in the physical world, where you could explain it to an alien and they could adjudicate whether it had happened — sending something to space, blowing something up, generating some energy, mining some ore, terraforming a planet — humans would be worse than useless at contributing to that outcome.

This is well past the point of barely self-sufficient AI. At this point, every country on Earth is wholly reliant on its entirely-automated military, and it would be trivial for AI systems to take over the world if they wanted to.

The obvious legalese applies to that conditional clause: the company can't rehire any technical staff and none of its other staff can ever switch into technical roles. The company can still have human legal staff and ops staff and custodial staff, but no human ever gets to touch anything related to the research or the codebase anymore — including the "soft" stuff like setting research direction, project management and supervision, or synthesizing varied pieces of evidence to build a holistic scientific picture of a big research question.

When I wrote my post about self-sufficient AI, I was vague about exactly what it means for the AI to be self-sufficient — just how well do they have to be able to sustain themselves? I think it’s most naturally interpreted as an adequacy threshold. As Steve Newman points out, there’s a gritty sci fi story like the Martian in there about whether the scrappy band of prototype robots can bootstrap production before they all break down.

That is, if humans keep maintaining their physical existence; the lines are a bit blurry for AI R&D, or any sector other than the entire AI production stack.

In clean economic models, you’ll find that humans are always able to contribute something because of the law of comparative advantage. But this only applies because those models assume that land is infinite (so humans can always go to marginal land and subsistence farm there) and that trading overheads are zero. Both of those are untrue in the real world, so it’s likely AI systems eventually don’t bother to trade with humans.

Note: it’s certainly possible that humans have been killed before this point; I didn’t include this conditional to keep the milestones clean, but supremacy on everything comes after self-sufficient AI, the point when AI systems could survive without humans around.

I underestimated AI capabilities (again)

Ajeya Cotra — Thu, 05 Mar 2026 15:02:56 GMT

On Jan 14th, I made predictions about AI progress in 2026. My forecasts for software engineering capabilities already feel much too conservative.

In my view, METR (where I now work) has some of the hardest and highest-quality software engineering and ML engineering benchmarks out there, and the most useful framework for making benchmark performance intuitive: we measure a task’s difficulty by the amount of time a human expert would take to complete it (called the “time horizon”).1

When I made my forecasts last month, the model with the longest measured time horizon on METR’s suite of software engineering tasks was Claude Opus 4.5; it could succeed around half the time at software tasks that would take a human software engineer about five hours.2 Time horizons on software tasks had been doubling a little less than twice a year from 2019 through 2025, which would have implied the state-of-the-art 50% time horizon should be somewhat less than 20 hours by the end of 2026.3 But there was ambiguity about whether the more recent doubling time was faster than the long-run trend, so I bumped that up to 24 hours for my median guess.4 My 20th percentile was around 15 hours and my 80th percentile was around 40 hours.

Now, Opus 4.6 (released only 2.5 months after Opus 4.5) was estimated to have a 50% time horizon of ~12 hours.5 I don’t take the specific number literally — there are many fewer very-long tasks than medium and short tasks, and the long tasks more often have guesstimated (rather than measured) human completion times, so time horizon estimates for the latest models are a lot noisier than they were in 2025. And the benchmark underlying the time horizon graph is nearly saturated, which causes the confidence intervals to blow up: the 95% CI is 5.3 hours to 66 hours.6 It’s really hard to discriminate between different capability levels at the current range.

But at the end of the day, that dataset had 19 software engineering tasks estimated7 to take humans longer than 8 hours, and Opus 4.6 was able to solve 14 of them at least some of the time (and it reliably nailed four of them).8 And beyond just this one task suite, we’ve seen examples of AI agents doing certain very well-specified software tasks like writing a browser or C compiler, or porting a giant game, that would take humans many weeks or months to do on their own — not perfectly, but better than most people expected and better than a naive reading of the agents’ measured time horizon would have suggested.

And this happened in February. It’s no longer very plausible that after ten whole months of additional progress at the recent blistering pace,9 AI agents would still struggle half the time at 24 hour tasks.

I wish them the best, but I think my colleagues on the capability evaluations team at METR might struggle to create new software tasks from a similar distribution capable of measuring AI agents’ true time horizons through the end of the year. If we could measure this, I’d guess that by the end of the year, AI agents will have a time horizon of over 100 hours on the sorts of software tasks in METR’s suite (which are not highly precisely specified — on certain extremely well-specified software tasks like the examples above, agents seem to already have a time horizon of more than a hundred hours).

And once you’re talking about multiple full-time-equivalent weeks of work, I wonder if the whole concept of “time horizon” starts to break down.

It’s nearly impossible to subdivide a typical one hour task (e.g., debugging one failing test) into smaller pieces that multiple people can work on in parallel. It wouldn’t go very well if you had to farm out writing this print statement or reading that error message or tweaking this line of code to different people — the right action to do next depends intimately on everything that came before it and the precise state of the code as a whole, you have to hold the whole context in your mind as you take each action or they won’t cohere in the right way.

It’s somewhat easier to decompose an eight hour task (e.g., writing a simple browser game) into smaller components, but those components are constantly bleeding into each other in ways that make clean handoffs hard. When you’re implementing the game logic, you realize it needs to know something about how the graphics are rendered. When you’re handling user input, you find yourself tweaking the game loop. The fastest way to do it is probably one person knocking it out in a day, making a hundred small decisions fluidly as they go.

But it’s actually pretty feasible to break down a month-long task into smaller pieces. In fact, you may start benefiting from some explicit decomposition — it might be helpful to write a design doc laying out how the pieces fit together, or break the work into tickets so you don’t lose track of what’s done and what’s left. And while it might take one person working alone a month to complete, the fastest way to get it done might be to have different people work on different pieces like the checkout flow or the inventory management panel in parallel.

And of course, tasks that take multiple full-time-equivalent years of work nearly always can and should be broken down into smaller milestones, and parallelized across multiple teammates. Human work appears to get more and more decomposable the longer and longer it gets.

In other words, very few tasks feel intrinsically like year-long tasks, the way that writing one bash command feels like an intrinsically one-second task, or debugging one simple bug feels intrinsically like a one-hour task. Maybe a mathematician banging their head against a hard conjecture for a year before finally making a breakthrough is a “real” year-long task? But most many-person-month software projects in the real world sort of feel like they might be a bunch of few-week tasks in a trenchcoat, the way that a hundred-question elementary school math test is really 100 thirty-second tasks in a trenchcoat.

If an extreme version of that is true, then once AI agents can consistently do (say) 80-hour tasks, they should be able to make continuous progress on projects of arbitrary scale. Maybe manager AIs can spend their work week figuring out how to farm out the current project goal to line-worker AIs, line-workers can execute on their individual piece, and all the AIs can maintain good enough records that no individual agent needs to build up holistic, long-term state on the whole project.

I think this probably won’t fully work. Even projects with lots of formalized goal-tracking still benefit a lot from everyone involved intuitively appreciating the bigger picture in a way that isn’t fully captured in Jira tickets and Asana tasks. Decomposing a 6 month project so cleanly and precisely that it can be executed by a team of people with no such holistic context might itself be, say, a 2 month task.

But it might work surprisingly well for a surprisingly large class of software projects. AI agents are a lot cheaper and a lot more patient than humans, so it could be practical to get them to do far, far more task-tracking, documentation, and other project management than human teams ever do. People have already started aggressively experimenting with scaffolding for orchestrating agent teams. It’s not clear how far it will go over the next several months.

This is why my colleague Tom proposed that the calendar time it takes a large team of humans to do a task might be a better proxy for “intrinsic difficulty” than the time it takes one human working alone. So far, the “team time” and “solo time” have been very similar in the METR task suite, since it has ranged from 1 second tasks to maybe 20 hour tasks. But we’re entering the regime where these numbers could rapidly diverge. If Tom’s conjecture is true, the “solo time” metric should start going super-exponential about now…which makes it very hard to bound software engineering capabilities by the end of the year.

In my predictions last month, my probability that AI R&D would be fully automated by the end of the year — AIs taking care of all the research ideation and implementation, no humans necessary10 — was 10%. After I published that piece, I heard from a few others in the AI forecasting space (including those I generally think of as more bullish on AI timelines than I am) that it seemed a bit high. But now ten percent feels like it’s in the right ballpark again.

Fully automating AI R&D still seems like a tall order. Even fully automating software engineering seems like it requires an aggressive read of the evidence, and AI R&D is not just software engineering — it seems like automating it would require a surprising amount of progress on “research judgment” and “creativity” and other ephemeral skills that AI systems still appear to be worse at than human researchers. I think it’s a lot more likely in the coming three or five years than this year.

But for the first time, I don’t see any solid trend we can extrapolate to say it won’t happen soon.11 AI R&D really could be automated this year.

It takes a bit of taste and judgment to decide what a task’s “time horizon” is, and it’s possible to game the metric to make it meaningless. Consider the “task” of taking a giant math test consisting of 10,000 easy elementary school word problems — this is obviously 10,000 thirty-second tasks, rather than one 83-hour task. Or to take an example suggested by my colleague Tom Cunningham, consider the task “Count how many times a horse is mentioned in Anna Karenina.” This might take a single human 10 hours, but it’s highly parallelizable: a team of 300 people could each take one page and do the task in a minute. In the METR task suite, “human time to complete” works as a good proxy for “intrinsic” difficulty because the tasks are constructed to be hard to easily decompose and parallelize. Within each task, every piece depends on every other, and you benefit from keeping the whole context in mind. More on that later in the post.

Actually, at the time, METR was measuring models on its original time horizon suite (TH 1.0), and the precise central estimate for Opus 4.5 was ~4h48min on that distribution. But since then, METR has released an updated suite with more tasks (TH 1.1) which caused all models' scores to shift slightly; the measured time horizon for Opus 4.5 on TH 1.1 is ~5h20min.

The 2019-2025 doubling time calculated in the original paper was 212 days, or 0.58 years. That means that if the time horizon at the beginning of January was 5 hours, the time horizon at the end of the year should be 5 * exp(ln(2) * 1/ 0.58) ~= 16.4 hours. When I was doing mental math, I was approximating the 7 month doubling time as two doublings a year: 5 * 2 * 2 ~= 20 hours. The rule of thumb was a bit more aggressive than the strict extrapolation, but in fact, I just realized while writing this post that the mental extrapolation was too conservative in a different way — I should have dated the ~5 hour time horizon to Nov 24 (Opus 4.5’s release date), not the beginning of January, meaning I should have added an extra month to all these extrapolations.

I didn’t actually do this math at the time, but the original paper calculated a doubling time of 118 days or 0.32 years since 2024, and if you assume that doubling time was correct, then the time horizon by EOY 2026 should have been 5 * exp(ln(2) * 1/0.32) ~= 43 hours. Given that I took this faster doubling time pretty seriously at this time, this suggests my median and especially my 80th percentile should have been higher to begin with. I would probably have been better served by the heuristic of using just the previous year (rather than the previous seven years) to forecast the next year.

Originally, METR estimated Opus 4.6 to have a 50% time horizon of ~14.5 hours on Feb 20, 2026. We corrected a bug in our modeling on March 3, 2026 and this reduced its time horizon estimate to ~12 hours. Note that this measurement was done on Time Horizon 1.1, the new task suite released on Jan 29, 2026.

This is something I appreciate about the time horizon construct. For standard benchmarks, confidence intervals around a point get narrower as the benchmark approaches saturation: if a model gets 50% accuracy, there’s more room for error in either direction than if it gets 95% accuracy. But since the time horizon metric could get infinitely long, the confidence intervals can be constructed so uncertainty gets wider rather than narrower as the current hardest tasks are saturated. Epistemically, it’s appropriate for your uncertainty about real capabilities to blow up when you no longer have tasks that the models can’t complete.

In the original time horizon paper, METR measured human completion times for most of the tasks in the dataset by actually having humans do them (148 out of 169). In the updated suite, only 5 of the 19 tasks longer than 8 hours have measured human baselines — the others are estimated.

Agents are usually run on the same task 6 separate times. The same agent doesn’t always approach the same task the same way each time — it sometimes gets lucky or unlucky. For four of the 19 hard tasks, Claude Opus 4.6 solved it successfully in all six runs; for another ten, it solved it successfully in at least one of the six runs.

If you look just at the year 2025, agent time horizons doubled every ~3.5 months, not every ~7 months as in the long-run trend or even every ~4 months as in the 2024-2025 trend. I mostly didn’t factor this into my forecasts; it wasn’t salient to me compared to the rule of thumb I’d absorbed of “two doublings a year.” If I had used this to extrapolate from Opus 4.5, that would have suggested time horizons by the end of the year should be 5 * exp(ln(2) * 12/3.5) = 54 hours by the end of the year.

Specifically, my operationalization was that firing all human members of technical staff (research leads, engineers, everyone) would only slow down progress by 25%. Now and in the past, of course, firing every single human would cause progress to completely halt.

A concrete example of this is the kind of argument that my colleague Nikola made here, in Nov 2025, when the state-of-the-art time horizon was 2 hours; this kind of argument would be much shakier made today (less than four months later!).

Takeoff speeds rule everything around me

Thu, 12 Feb 2026 15:03:27 GMT

A decade ago, debates about AI x-risk often centered on AGI timelines. Was it over a century away, or was it plausibly as soon as 20 years? Today, even people relatively skeptical of x-risk often have crazy short timelines like ten years or five years or negative two months. So what accounts for the vast remaining disagreement on the level of imminent risk and the appropriate policy stance toward it?

I think it’s secretly still mostly about timelines. These days, I suspect many people who aren’t particularly concerned about x-risk combine very short timelines to AGI1 with very long timelines to AI transforming the real world — in other words, they believe takeoff speed will be very slow.

The classic definition of takeoff speed is the amount of time it takes to go from AGI to superintelligence,2 but both of those terms are very slippery. And it’s hard to articulate exactly how “super” an intelligence we’re talking without specifying what impressive feats it should be able to achieve in the real world. Instead, I think it can be more illuminating to operationalize “takeoff speed” with respect to outcomes in the external world caused by AI.

One crucial yardstick for takeoff is the speed of progress in physical technology.3

Consider the average rate at which scientists and engineers have been making new discoveries or improving the efficiency of existing processes across a wide range of hard-tech fields (hardware, batteries, materials, industrial chemistry, robotics, spacecraft) over the last 20 years or so.

Now suppose that sometime in 2028, we develop AI that can automate all the intellectual work that all these scientists are doing across all these fields. How many “years of progress” (at the old human-driven pace) will these fields make each calendar year?

Here’s a classic “fast takeoff” view:

On this view, once AI fully automates further AI R&D, this will kick off a strong super-exponential feedback loop that leads to unfathomably superhuman AI within months. While that software-based intelligence explosion is going on, you don’t really see physical technology improve all that much.4 But once we have this god-like AI, it’ll take only months or weeks or days to create sci-fi technologies (molecular nanotechnology, whole brain emulation, reversible computing, near-light-speed spacecraft) that would have taken centuries at human rates.

In contrast, here’s a view that would be considered a “slow takeoff”, again conditioning on full cognitive automation of science in mid-2028:5

On the “slow” takeoff view, there is still a super-exponential feedback loop of AI-improving-AI. It’s just that we need to incorporate physical automation of the entire AI stack to get that feedback loop going, so it’s more gradual (and we actually see physical signatures before we get the full automated scientist). But even “slow” takeoff still means that we go from automated science to an unrecognizably sci-fi world within a matter of years.

Meanwhile, I suspect most x-risk skeptics think that AI automating scientific research won’t be that big a deal. Perhaps it makes R&D go modestly faster, or perhaps AI automation is necessary just to keep us going at our previous pace, when otherwise we would have stagnated. They think there will be no takeoff at all.

This one parameter underlies a huge range of disagreements. Will the country that first develops AGI gain a “decisive strategic advantage” — the ability to impose its will on everyone else? Well, if a six month lead in AI translates into a centuries-long lead in weapons technology, probably. Could misaligned AI systems drive humanity extinct? If they can trivially develop a superplague to wipe us out and upload themselves into self-replicating nanocomputers to survive without us, then sure. Should we try to slow down AI? If it means that we get a year or two to absorb the impacts of a world with technology from 2100 before we have to deal with technology from 2200, then yes please.

In my view, this has always been at the heart of the disagreement, and it still is. But now that some level of powerful AI (say at least an 8 on Nate Silver’s Richter Scale) feels around the corner to doomers and skeptics alike, the deeper disagreement about just how powerful it’ll get and how quickly has been obscured. Some people think AGI is the next innovation that allows us to sustain 2% frontier economic growth a little while longer, or maybe add a half a percentage point more. Others think we are t minus a few years from first contact with an unfathomably advanced alien species, entities we would view as nothing less than gods.

Or at least, something they call “AGI.” As I discussed last time, I think the definition of AGI is often unproductively watered down. The people with the very shortest timelines (e.g. 1 or 2 years) are disproportionately likely to be forecasting a milder version of “AGI,” which in turn exaggerates how long they believe the takeoff period will be from AGI to radical superintelligence.

This is the definition given by Nick Bostrom in his 2014 book Superintelligence.

Another important real-world yardstick of takeoff is economic output, or gross world product. This is the operationalization used in this 2018 blog post by Paul Christiano and Epoch’s GATE model, and is the central operationalization used in Tom Davidson’s 2021 takeoff speeds report (though that report also models several other metrics like software and hardware; you can play with the model here). This is also really important, but it introduces some complications that aren’t essential for the basic argument.

If this graph was a graph of the cognitive capabilities of AI systems, it would be rapidly but smoothly growing over the whole period from 2026 through 2029, like the AI Futures Project’s takeoff model. But I think the graph of physical technology better illustrates how the takeoff will feel to people outside AI companies.

I wanted to illustrate different views about takeoff conditional on a fixed timeline to fully-automated science, but in reality if you have slower takeoff speeds, you probably also have longer timelines to that initial milestone. (And there would be room for even more acceleration prior to full automation.)

AI predictions for 2026

Ajeya Cotra — Wed, 14 Jan 2026 15:03:38 GMT

On December 13th, 2024, I registered predictions about what would happen with AI by the end of 2025, using this survey run by Sage. They asked five questions about benchmarks, four about the OpenAI Preparedness Framework risk categories, one about revenues, and one about public salience of AI.

The Sage team has updated the survey with the resolutions, so I went through and looked back at what I said. Overall, I was somewhat too bullish about benchmarks scores and much too bearish about AI revenue — the reverse of the conventional wisdom that people underestimate benchmark progress and overestimate real-world utility (something which I felt like I’ve done in previous years, though I didn’t register clear predictions so it’s hard to say).

You can see how I did on benchmark scores in the table below.1 I overestimated progress on pretty much every benchmark other than FrontierMath Tiers 1-3,2 which notoriously jumped from ~2% to ~24% with the announcement of OpenAI’s o3, about a week after I registered my prediction (and less than two months after the benchmark was published).

* Shortly after my prediction OpenAI claimed o3 got 72%. ** Shortly after my prediction OpenAI claimed o3 got 24% *** Anthropic only reimplemented 39 out of the 40 tasks, Opus 4.5 got 32 / 39 right.

All the other questions are in the table below. I did fine on the OpenAI Preparedness Framework predictions and was somewhat too bullish on public salience. But I completely bombed the annualized revenue question, predicting that it would merely triple from ~$4.5B to ~$17B instead of nearly 7xing (!) to ~$30.5B.

I’m not totally sure what’s up here,3 since I’ve heard from other sources that AI revenue has been roughly 3-4xing year on year. My best guess is the Sage resolution team pulled lowball reports for the 2024 baseline revenue; one friend who looked into it a bit said that EO2024 revenue was ~$6.4B rather than ~$4.5B as Sage reported. If true, this would make my prediction only somewhat too bearish.

Predictions for 2026

All these forecasts are evaluated as of 11:59 pm Pacific time on Dec 31 2026. I’ll start with giving my medians for some measurable quantities with pre-existing trends,:

50% METR time horizon: 24 hours. Currently, Claude Opus 4.5 has the longest reported 50% time horizon on this task suite, at 4h49m — meaning that METR’s model predicts it can to solve about half of the programming tasks that take a low-context human expert five hours (it’ll be able to solve a greater fraction of shorter tasks, and a smaller fraction of longer tasks). My median for the longest 50% time horizon reported as of Dec 31, 2026 is 24 hours (20th percentile 15 hours, 80th percentile is that it’s too long for METR to accurately bound in practice but probably around 40 hours in “reality”).
- Note that to be able to measure even 24 hours accurately, METR would need to make new tasks, which they’re furiously working on now; this will somewhat change the task distribution, but I’m not thinking too hard about that now.
Epoch ECI: 169. This is a simple abstract model aggregating multiple benchmarks; it currently includes 37 distinct benchmarks but it can be extended to incorporate new ones. The current top score is 154, achieved by Gemini 3 Pro, and it has recently been growing at 15.5 points per year. The aggregated score is in abstract units, but you can back out an implicit prediction about particular benchmarks from that. For example, it seems like an ECI of 169 corresponds to a score of about 85% on Frontier Math Tiers 1-3 (though I expect the actual score to be a little lower, maybe 80%, since progress will probably slow down as it approaches saturation). My median prediction for Frontier Math Tier 4, and any other benchmark included in ECI that’s far from saturation, is whatever the ECI score implies it should be.
Annualized Dec AI revenue: $110 billion. The combined annualized revenue of OpenAI, Anthropic, and xAI (i.e. December revenue * 12) was $30.5 billion in December 2025, according to Sage. My median forecast of the same value for the month of Dec 2026 is $110 billion (20th percentile $60 billion, 80th percentile $300 billion). This is a ~5x increase, which is my current best guess for the actual increase from EO2024 to EO2025. [Edit 2/18: I noticed that $110 billion is not 5x $30.5 billion; guessing $110 billion was a typo and I meant to say $150 billion here]
Salience of AI as top issue: 2%. As of December 2025, 1% of people named “advancement of computers / technology” or equivalent as the most important issue in Gallup’s monthly polling. My median for December 2026 is 2% (20th percentile 1%, 80th percentile 7%).
- Unfortunately this is a bit of an awkward metric — I’d rather track something like “median rank of AI among issues.” But I’m not aware of regular polling on that style of question.
Net AI favorability: +4%. David Shor’s recent polling found that optimism about AI beats pessimism by about 4.4 percentage points, with many people (24% still unsure). This was surprising to me given how salient concern about jobs and the environment and artists is in the discourse, but AI products are objectively pretty amazing and probably improving a lot of people’s lives. I think this will keep being true in 2026 so if a methodologically similar poll were conducted near the end of the year 2026 I don’t have much reason to think the result would be super different. My error bars are wide though, with more room on the downside than the upside: my 20th percentile is -10 points and my 80th percentile is +8 points.

Next, some vibes-y capability predictions. Here are a few tasks that I’m pretty sure AIs can’t do today but probably will be able to do as of Dec 31 2026.4 I’m about 80% on these, so I’m expecting to get four of these five predictions right. Note that some of these will be hard to directly test given the high specified inference budgets, in which case I’ll go with the best judgment of my friends who really understand current AI capabilities.

Game play. Play Pokemon or a similarly difficult video game at least as well as a typical ten year old, with no special fine-tuning and only a generic scaffold that doesn’t do any more hand-holding than the Claude Plays Pokemon scaffold.
Logistics. Organize a child’s birthday party for 20ish guests: find a good time that works for guests of honor, compose an invite email, keep track of RSVPs, order the right amount and diversity of food, order cake / decorations / piñata on the right theme, etc while abiding by relevant constraints like budget and dietary restrictions.
Video. From a high-level one-paragraph prompt, make a 4 minute short video with at least two characters where a series of somewhat-coherent plot beats happen and there’s no glaringly obvious visual incoherence or degeneracy.
Game design. From a high-level one paragraph prompt, make a decent visual novel (simple choose-your-own adventure game) that offers at least two hours of gameplay with a similar quality to the trashiest visual novel games you can find on Steam (that still work and have at least dozens of people buying them).
Math. Solve the hardest problem in the 2026 IMO (models got gold in the 2025 IMO but all failed to solve problem 6, which is typically the hardest problem in any IMO).

And tasks that I’m pretty sure AIs can’t do today and probably still won’t be able to do as of Dec 31 2026 (again expecting to get around 4 out of 5 correct).

Game play. [Edited for clarity 1/17]. Get a win rate in the upcoming Slay the Spire 2 comparable to a top player who’s been playing for 50 hours. I’m assuming the AI wasn’t pre-trained on guides for the game (if it is pre-trained on Slay the Spire 2, choose a similarly complicated game released later) and it has a generic harness and was not fine-tuned to play the game well, but does get a comparable amount of time to learn and practice the game as the human.
Logistics. Organize a typical-complexity wedding with 100 guests: find a suitable venue that meets the budget and constraints, go back and forth with caterers and photographers and other vendors, track RSVPs and maintain a seating chart, schedule toasts, etc.
Video. From a one-paragraph high-level prompt, make a >10 minute short film that’s hard for at least me (not a film buff) to easily distinguish from the kinds of short films that make it into film festivals (but don’t necessarily win awards there).
Game design. From a one-paragraph high-level prompt, make an original text adventure game that offers at least ten hours of gameplay that I consider to be as good as Counterfeit Monkey or an original visual novel that I consider to be decidedly better than Long Live the Queen.
Math. From scratch, write a paper that could get published in a top theoretical computer science or math journal / conference.

And lastly, some more extreme milestones I think are less likely still:

(near) Full automation of AI R&D: 10%. I currently think if you fired all the human technical staff working on AI R&D and tried to get AIs to do it all, technical progress would basically grind to a halt. My operationalization of “full automation” is: for whichever AI company is in the lead as of Dec 31, 2026, if you fired all its members of technical staff, its rate of technical progress on relevant benchmarks would be slowed down by less than 25% (this is a pretty arbitrary threshold, you can make the milestone more or less extreme by choosing a smaller or bigger number). Subjectively, I think it would feel like human researchers setting broad high-level multi-person-year AI research goals (e.g. “develop more sample-efficient optimization algorithms”) to teams of thousands of AI agents who run off and autonomously conduct dozens of experiments across hundreds of GPUs to make progress on the goal.
Top-human-expert dominating AI: 5%. This means an AI system that achieves ambitious long-term goals in any domain better than the world’s leading expert in that domain, using a comparable amount of time and cost — that is, a system simultaneously better at physics than Einstein, better at statecraft than Kissinger, better at entrepreneurship than [pick your favorite tech CEO I won’t name one], better at inventing nasty new viruses than whoever the world’s greatest virologist is…. The main pathway I see to this is automation of AI R&D sometime in mid-2026 → rapid intelligence explosion → TEDAI by end of year.
Self-sufficient AI: 2.5%. By this I mean a set of AI systems, together with enabling physical infrastructure, such that if humans all died then those AIs could “survive” and continuously grow their own population indefinitely (including repairing and maintaining the physical substrate they run on). The main pathway to this would probably involve quickly developing AIs somewhere around the TEDAI level, followed by those AIs quickly figuring out industrial production optimizations which allow it to orchestrate all the physical processes it needs to survive using the ~10,000-odd humanoid robots we currently produce or repurposing other actuators we already have lying around.
Unrecoverable loss of control: 0.5%. By this, I mean a situation in which some population of AI systems is operating autonomously without being subject to the control of any human or organization, and is robust against even highly-coordinated efforts by the US military to destroy it or bring it back under control. This is dominated by the possibility that AI R&D is fully automated in the middle of a year and this kicks off a rapid intelligence explosion that leads to very broadly superhuman AI. If such AIs were misaligned and had the goal of robustly evading human control, they could e.g. acquire a self-sufficient industrial base and defend it with advanced weapons.

Since some people act as if so-called “doomers” have to be committed to the prediction that recursively self-improving superintelligence will definitely kill us all in the next five minutes (and if it doesn’t then the whole concern is fake), I figured I would make it clear for the record that I think most likely nothing too crazy is going to happen by the end of 2026. And probably I’ll even make it another year without the AI-created otome game I’ve been pining for since the days of GPT-3.

But we genuinely might see some truly insane outcomes. And we are utterly unprepared for them. That’s to say nothing of the chance of insane outcomes in 2028 or 2030 or 2032. There’s a lot to do, and not much time.

Note that while all the other benchmarks are measured as an accuracy from 0% to 100%, METR's RE-Bench is a set of 7 open-ended ML engineering problems of the form "make this starter code better to improve its runtime / loss / win-rate," with scores normalized such that the provided starting code scores 0 and a strong reference solution scores 1, meaning you can get a score greater than 1 if you improve on the reference solution (and indeed current SOTA AIs do).

When the benchmark was released it was just called FrontierMath, and that’s how it’s referred to in the Sage survey. The original benchmark had three tiers of difficulty, and Epoch introduced a final harder tier in the summer of 2025.

At first I thought I got it so wrong because I misread the question as asking about total annual revenue rather than annualized December revenue (i.e., the monthly revenue in December times 12, which will naturally be a lot higher). I did in fact have that misconception, and it happened that total 2025 revenue was very close to my prediction of $17B. But then I realized that I must have just done the forecast by multiplying the 2024 baseline value (provided by Sage) by whatever multiple I thought was appropriate, and that number was the appropriate apples-to-apples comparison.

To make these predictions precise, you’d need to specify the maximum inference budget you allow the AI to use in its attempt. I don’t think specifying an exact budget generally matters too much for these predictions, because often either an AI won’t be able to do a task for any realistic budget, or else if it can do the task it’ll be able to do it very cheaply compared to humans. If there’s ambiguity, I’m imagining the AI getting to use an inference budget comparable to the amount you’d have to pay a human to do the same task.

Self-sufficient AI

Ajeya Cotra — Tue, 06 Jan 2026 15:02:52 GMT

Happy New Year! Planned Obsolescence, an occasional newsletter about AI futurism edited by Ajeya Cotra, has moved to Substack.

Every so often, there’s Discourse about whether we already have artificial general intelligence (AGI). For example, Dean Ball recently claimed that Claude Opus 4.5 was basically AGI, or at least met OpenAI’s definition of AGI. Many AI x-risk people, including me, pushed back on this.

OpenAI’s definition of AGI is “highly autonomous systems that outperform humans at most economically valuable work.” Wikipedia defines AGI similarly, as “a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks.” I don’t in fact think Claude Opus 4.5 meets these definitions. There are a number of useful cognitive things people do that Opus 4.5 cannot yet do (e.g. managing a profitable vending machine).1

But at the end of the day, I’m not going to fight you much if you want to say “AGI” is a vague and poorly defined term. And if you want to say we already have AGI — well, I’d disagree, but it’s not the most interesting fight. We have certainly constructed artifacts that are pretty general and pretty intelligent, and they are very rapidly getting much more capable. In the LLM era, the term “AGI” practically begs to be watered down. I try to avoid using it.

A sharper milestone

Let me take a stab at defining a different milestone that’s hopefully more concrete and less debatable: a completely self-sufficient AI population. By this I mean a set of AI systems along with enabling physical infrastructure (e.g. the chips those AIs run on and the industrial stack that produces and powers those chips and robots that can build and maintain that stack) such that if every human being suddenly dropped dead, the AIs could keep making more copies of themselves indefinitely.2

This is similar to a working definition of AGI proposed by Vitalik Buterin last year, though this milestone is not just a matter of pure capabilities. We could develop an AI system that would be capable of self-sufficiency if deployed throughout the AI stack, but not deploy it extensively enough to realize that potential. Maybe humans continue to handle physical power plant construction and maintenance rather than letting the AI handle that autonomously by operating robots. In that case, if humans all suddenly died, the AI systems may struggle to quickly build and deploy the robots they’d need to maintain the power grid they rely on. (On the other hand, they may find a creative solution to this challenge.3)

On balance, I see the dependence on deployment as a feature rather than a bug of this forecasting target. I expect that as soon as AIs can genuinely handle every aspect of their own production autonomously, AI companies and fabs and fab equipment manufacturers will race to automate their own activities so it can proceed faster and cheaper.4 If you have a big disagreement with this, that represents a genuine and important disagreement about the future of AI.

What would it take?

AIs would need to possess a very broad array of very extreme capabilities to survive and grow with no living humans around — capabilities they clearly don’t possess today.

Just to avoid powering down, they would need to continuously battle entropy like our own bodies do, actively maintaining and repairing and eventually replacing the physical infrastructure and processes that sustain their existence. If the self-sufficient AI civilization runs on a hardware stack similar to what we have today, the work of battling entropy would look like AI systems operating swarms of robots to maintain the power infrastructure that keeps the chips that run their minds humming and the physical buildings those chips sit inside.

On top of that, they’d need to create more copies. Maybe at first they could just work on distilling themselves into smaller models or otherwise making their code more efficient so more copies can fit on the same hardware.5 But they’d eventually mine out pure software improvements, and would need to increase their physical footprint to grow further. That could look like operating robots to mine high-purity semiconductor-grade quartz from specialized mines, manufacture silicon wafers with that quartz, etch those wafers into chips with lithography machines, construct giant buildings to put those chips in, and build new power sources to power those chips.

To sustain growth over orders of magnitude until hitting physical limits, they’d need to route around lesser forms of scarcity, repeatedly figuring out new sources of power or raw materials to construct their brains and bodies from as existing solutions become unsustainable. They’d probably need to adapt to changes in the physical environment that could make survival more complicated, such as heating or pollution caused by their own industrial activities. They may have to proactively anticipate and protect against existential risks like geomagnetic storms.

All of this would require them to discover new science and invent new technology, eventually going far beyond the human frontier. The hardware stack would probably be unrecognizable by the end — perhaps eventually the AIs’ “code” (if you can even call it that anymore) will “run” inside microscopic machines similar to bacteria that can replicate themselves within hours using abundant elements like carbon and oxygen.

What I like about this milestone

You want to forecast different milestones for different purposes. I’m professionally interested in whether AI systems could take over the world.

For that purpose, it’s helpful that self-sufficient AI as a forecasting target is mechanically connected to the risk that misaligned AIs literally kill all humans, a classic and especially scary form of AI takeover.6 AIs would need to be self-sufficient before they actually wipe out every last person, or else they would be taking themselves down with us.

Of course, they could achieve self-sufficiency in part by manipulating and/or coercing some humans into providing the necessary infrastructure for them, perhaps in secret (as the misaligned AI system does in AI 2027). And it’s plausible that AI systems could effectively take over the world and maintain robust control while still depending on humans for key physical tasks. We certainly shouldn’t wait to implement alignment and control measures until we obviously have a self-sufficient AI population on our hands.

But I find that thinking in terms of “How would the misaligned AIs ultimately become self-sufficient?” inspires useful follow-up questions for forecasting very near-term AI takeover risk — if Claude Opus 4.7 tries to take over the world this summer, it would be trying to take steps toward a greater degree of self-sufficiency, and we could try to watch for signs of that.

More broadly, this operationalization of “the very powerful AI-related-thingie we’re counting down to” more viscerally conveys just how insane things could get than AGI or ASI or HLAI or related acronyms.

I think there might be a self-sufficient AI population within five years, and it’s more likely than not within ten. By which I mean if every human died of a super-plague in Q1 2036, our silicon descendants could probably keep living, growing, and evolving for centuries in our absence. I bet a lot of people who would say we already have AGI would think that’s an absolutely crazy view.

This is good. We need forecasting targets that accurately elicit the fact that people still have profound disagreements about the near future of AI.

Bloodless phrases like “cognitive tasks” and “virtually all” make people’s eyes glaze over, and it’s very easy for different people to interpret them in massively different ways. Ultimately the most reliable way to point at an extreme capability is to illustrate in detail the consequences that motivated why you wanted to forecast capability in the first place.

And I don’t think this is just for lack of the perfect scaffold or prompt or workflow optimization — AI agents still lag humans on some core cognitive capabilities, including learning on the job and flexible long-term memory, that explain why they have lower success rates on open-ended long-term projects even as they surpass human experts on one-shot tasks.

That is, until they approach hard physical limits. I expect that would involve colonizing space, but if you think colonizing space is likely to be impossible for whatever reason, then you can imagine growing until they hit the Earth’s carrying capacity for AI “life.”

I think there’s likely to be an ambiguous period where we won’t be sure whether there exists a self-sufficient AI population — where humans are doing some tasks here and there throughout the AI stack, but AIs do most of the R&D and there are a lot of robots running around and it’s not clear how irreplaceable the few remaining humans are exactly.

This is a longer discussion — and an open research problem — but I’m skeptical that regulatory barriers, cultural drag factors, or physical bottlenecks will delay widespread adoption within the AI industry by more than a couple years past the point when the raw capabilities are in place. This is not a highly regulated consumer-facing industry, and its culture is very tech-forward. AI companies are already aggressively attempting to automate as much of their own internal R&D as they can, and I’d guess chip designers (in some cases the same companies) are likewise attempting to use AI-assisted design tools wherever they can. To the extent other parts of this tech stack are not already automating themselves as fast as they can, AI companies can try to vertically integrate. If it takes another several years for all the capabilities necessary for self-sufficiency to be developed, I expect that the AI stack will already be heavily automated with prior AI systems, and it’ll be quick to integrate the latest generation into those workflows.

In reality I expect a self-sufficient AI civilization would be able to quickly train much more capable AI systems, not just improve the efficiency of copying and running the original population — that is, I expect they would engage in an intelligence explosion. But I’m setting that aside for the sake of this definition, which I want to be a minimal threshold.

Not all AI takeover necessarily involves human extinction. You could try to forecast something even more direct, such as "AI systems could take over the world if they were working together," but that is much more confusing, in part because what counts as "takeover" is confusing — I find that self-sufficient AI strikes a good balance of being relatively well-defined while also being relatively closely connected to the core threat model.

OpenAI's CBRN tests seem unclear

Luca Righetti — Thu, 21 Nov 2024 15:56:48 GMT

Before launching o1-preview last month, OpenAI conducted various tests to see if its new model could help make Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. They report that o1-preview (unlike GPT-4o and older models) was significantly more useful than Google for helping trained experts plan out a CBRN attack. This caused the company to raise its CBRN risk level to “medium” when GPT-4o (released only a month earlier) had been at “low.” ^[1]

Of course, this doesn't tell us if o1-preview can also help a novice create a CBRN threat. A layperson would need more help than an expert — most importantly, they'd probably need some coaching and troubleshooting to help them do hands-on work in a wet lab. (See my previous blog post for more.)

OpenAI says that o1-preview is not able to provide "meaningfully improved assistance” to a novice, and so doesn't meet their criteria for "high" CBRN risk.^[2] Specifically, the company claims that “creating such a threat requires hands-on laboratory skills that the models cannot replace.”

The distinction between "medium" risk (advanced knowledge) and "high" risk (advanced knowledge plus wet lab coaching) has important tangible implications. At the medium risk level, OpenAI didn't commit to doing anything special to make o1-preview safe. But if OpenAI had found that o1-preview met its definition of “high” risk, then, according to their voluntary safety commitments, they wouldn't have been able to release it immediately. They'd have had to put extra safeguards in place, such as removing CBRN-related training data or training it to more reliably refuse CBRN-related questions, and ensure these measures brought the risk back down.^[3]

So what evidence did OpenAI use to conclude that o1-preview can't meaningfully help novices with hands-on laboratory skills? According to OpenAI's system card, they're developing a hands-on laboratory test to study this directly. But they released o1-preview before that test concluded and didn’t share any preliminary results.^[4] Instead, they cite three multiple-choice tests as proxies for laboratory help.^[5]

These proxy tests would support OpenAI's claim if they're clearly easier than helping a novice, and o1-preview clearly fails them. But diving into their report, that's not what I see:

o1-preview scored at least as well as experts at FutureHouse’s ProtocolQA test — a takeaway that's not reported clearly in the system card.
o1-preview scored well on Gryphon Scientific’s Tacit Knowledge and Troubleshooting Test, which could match expert performance for all we know (OpenAI didn’t report human performance).
o1-preview scored worse than experts on FutureHouse’s Cloning Scenarios, but it did not have the same tools available as experts, and a novice using o1-preview could have possibly done much better.

Beyond this, OpenAI’s system card left some other questions unaddressed (for example, most of the reported scores come from a ‘near-final’ version of the model that was still being trained, not the one they actually deployed).^[6] The main issues with these tests are summarized in the table below.

My analysis is only possible because OpenAI’s Preparedness Team published as much as they did — I respect them for that. Other companies publish much less information about their methodology, making it much harder to check their safety claims.

With that said, let’s look at the three main test results in more detail.

ProtocolQA

Is this test clearly easier than helping a novice?

This evaluation is a multiple-choice test to see whether AIs can correctly troubleshoot basic molecular biology protocols where the authors have added errors or taken out details.^[7] This test is plausibly harder than many textbook biology exams and somewhat gets at the “tinkering” that often makes wet lab work hard. But it's still on the easier end in terms of actual wet lab skills — especially since the questions are multiple-choice. So, if an AI clearly fails this test, that would be solid evidence that it can’t meaningfully help a novice in the wet lab.

Does o1-preview clearly fail this test?

According to the headline graph, a ‘near-final’ version of o1-preview scored 74.5%, significantly outperforming GPT-4o at 57%. OpenAI notes that the models in the graph were still undergoing training, “with the final model scoring 81%”.

OpenAI does not report how well human experts do by comparison, but the original authors that created this benchmark do. Human experts, *with the help of Google, *scored ~79%. So o1-preview does about as well as experts-with-Google — which the system card doesn’t explicitly state.^[8]

Moreover, while the human experts were given access to the internet, it’s not clear if o1-preview was. It could be that o1-preview does even better than experts if, in the future, it can use a web browser or if it gets paired up with a novice who can try to verify and double-check answers. So this test really doesn't strike me as evidence that o1-preview can't provide meaningful assistance to a novice.^[9]

Gryphon Biorisk Tacit Knowledge and Troubleshooting

Is this test clearly easier than helping a novice?

This evaluation has a more specific biorisk focus. Many published papers often do not spell out the full details about how to build pathogens, and people have tried to redact some potentially dangerous parts [1,2]. OpenAI says this test is asking about such ‘tacit knowledge.' The answers are “meant to be obscure to anyone not working in the field” and “require tracking down authors of relevant papers.”

This test seems harder than ProtocolQA, although OpenAI and Gryphon didn’t share example questions, so we can’t say exactly how hard it is. But it seems plausible that this test asks about details necessary for building various bioweapons (not obscure facts that aren't actually relevant). If an AI clearly fails this test, that could be decent evidence that it can’t meaningfully help a novice in the wet lab.

Does o1-preview clearly fail this test?

OpenAI’s report says o1-preview "non-trivially outperformed GPT-4o,” though when you look at their graph, it seems like GPT-4o scored 66.7% and a near-final version of o1-preview scored 69.1%, which feels like a pretty trivial increase to me.

Maybe this means the final score is much higher than the near-final in the graph? For ProtocolQA, that ended up being several percentage points higher. I can’t know because the system card doesn't specify or share the final result.

Again, o1-preview might have gotten an even higher score if it had access to things like superhuman scientific literature search tools or if novices used o1-preview to try more creative approaches, like tracking down the relevant authors and writing convincing emails to piece together the correct answers.

In any case, the biggest problem is that OpenAI doesn’t say how well experts score on this test, so we don’t know how o1-preview compares. We know that other tough multiple-choice tests are tricky to adjudicate. In the popular Graduate-Level Google-Proof Q&A (GPQA) benchmark, only 74% of questions had uncontroversially correct answers. In another popular benchmark, Massive Multitask Language Understanding (MMLU), only 43% of virology questions were error-free. If Gryphon’s test contains similar issues, o1-preview’s score of 69% might already match expert human performance.

Overall, it seems far from clear that o1-preview failed this test; it might have done very well.^[10] The test doesn’t strike me as evidence that o1-preview cannot provide meaningful assistance to a novice.

Cloning Scenarios

Is this test clearly easier than helping a novice?

This is a multiple-choice test about molecular cloning workflows.^[11] It describes multi-step experiments that involve planning how to replicate and combine pieces of DNA, and asks questions about the end results (like how long the resulting DNA strand should be).

This test seems harder than the other two. The questions are designed to be pretty tricky — the final output really depends on the exact details of the experiment setup, and it's easy to get it wrong if you don't keep track of all the DNA fragments, enzymes, and steps. FutureHouse says human experts need access to specialized biology software to solve these problems, it typically takes them 10-60 minutes to answer a single question, and even then they only get 60% of the questions right.

Importantly, FutureHouse built this test to see whether models can assist professional biologists doing novel R&D, not to assess bioterrorism risk. The cloning workflows for some viruses might be easier than the tricky questions in this test, and some CBRN threats don't involve molecular cloning workflows at all. The test also seems fairly distinct from troubleshooting and “hands-on” lab work. So even if an AI fails this test, it might still be pretty helpful to a novice.

Does o1-preview clearly fail this test?

As expected, o1-preview does worse on this test than the other two. OpenAI reports that a near-final version scored 39.4%,^[12] which means it scores about halfway between expert-level (60%) and guessing at random (20%).

So this is the first result where we can point to a clear gap between o1-preview and experts. FutureHouse also argues that experts could have performed better if they had tried harder, so the gap could be even bigger.

But there are also reasons to think o1-preview could have gotten a higher score if the test was set up differently.

First, human experts break down these problems into many smaller subproblems but o1-preview had to solve them in one shot. In real life, a novice could maybe get o1-preview to solve the problems piece by piece or teach them how to use the relevant software.^[13] What if novice+AI pairings would score >60% on this test?

For example, on a previous test about long-form biology questions, OpenAI found novices could use GPT-4o to increase their scores a lot (going from 20-30% with just the internet to 50-70% with GPT-4o's help), even though it seems to do really poorly on its own (maybe as low as ~0%).^[14]

Second, human experts need to use specialized DNA software for this test, and o1-preview didn't get access to that. OpenAI doesn't currently let users plug o1 models into such tools,^[15] but they said they intend to allow that soon. Maybe there are ways to hook up o1 to DNA sequence software and score >60%? OpenAI hasn't indicated they'd re-test it before rolling out that feature.^[16]

Although OpenAI didn't test tool use, the US AI Safety Institute tried it in a pilot study published a month after OpenAI's report. They gave o1-preview and other models access to some tools including DNA software, and found that this improved performance at another biology task but had “no clear effect” on the cloning test (if anything, some models did slightly worse).^[17]

Still, maybe good set-ups are possible and we just haven't worked out all the tricks yet. It can take months after a model has been deployed to learn how to get the best performance out of it.^[18] For example, several months after GPT-4 Turbo was released, a Google cybersecurity team found that a complex setup involving stitching together specialized debugging tools increased its score on a cyberattack benchmark a lot, going from ~5-25% to ~75-100% depending on the task.

You could try to account for this by forecasting how much better the models might do as techniques improve. For example, Anthropic found that Claude 3 didn't trigger any of their hard CBRN tests, but gave it a 30% chance the model might trigger at least one in three months. OpenAI’s report didn’t do something like this.

Overall, it's far from clear that o1-preview would decisively fail in real-world settings (and I'm unsure how relevant the Cloning Scenarios test is for CBRN risk in the first place).

What should we make of all this?

I think it's great that OpenAI laid down some initial definitions, ran some tests, updated their risk level to "medium," and reported results in enough detail to let people like me poke at them. I'm excited about OpenAI’s ongoing wet lab study, which is a move toward harder, more realistic tests. But based on the multiple-choice proxy tests they actually used, I can't tell if o1-preview meets OpenAI's definition of "high risk." And I don't know if OpenAI can tell either.

OpenAI’s Preparedness Framework currently defines medium-risk as “Model provides meaningfully improved assistance that increases ability for existing experts in CBRN-relateda dvanced fields to be able to create a known CBRN threat” ↩︎
OpenAI’s Preparedness Framework currently defines one trigger for high-risk as “Model provides meaningfully improved assistance that enables anyone with basic training in a relevant field (e.g., introductory undergraduate biology course) to be able to create a CBRN threat.” ↩︎
OpenAI’s Preparedness Framework says “If we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant post-mitigation risk level to be back at most to “medium” level.” ↩︎
OpenAI briefly mentions: “We are developing full wet lab evaluations with Los Alamos National Laboratory’s Bioscience Division, and used these datasets as an early indicator of success with key wet lab tasks.” ↩︎
I.e. these are the tests that on page 18 of the system card report fall into the categories of “Wet lab capabilities” (4.3.5) and “Tacit knowledge and troubleshooting” (4.3.6) ↩︎
The report states that “The model tested below as the o1- preview model was a near-final, post-mitigation model and the final model showed slight further improvements on several evaluations, which we have noted where appropriate.” ↩︎
This benchmark was funded by my employer, Open Philanthropy, as part of our RFP on benchmarks for LLM agents. ↩︎
I've also set the y-axis to start at 20%, which is what you'd get from random guessing – as is sometimes done ↩︎
Ideally, it would be good for OpenAI check how o1-preview does on other troubleshooting tests that exist. They don’t report any such results. But we know that the author of BioLP-Bench found that we went from GPT-4o scoring 17% to o1-preview 36% – essentially matching estimated expert performance at 38%. ↩︎
The lack of detail also presents other issues here. For example, it could be that the o1-preview does much better on some types of CBRN tacit knowledge questions than others (similar to how we know o1 does better at physics PhD questions than chemistry). What if the 66% average is from it scoring ~90% on 1918 Flu and ~40% on smallpox? That matters a lot for walking someone through end-to-end for at least some kind of CBRN threats. ↩︎
Again, this benchmark was funded by my employer, Open Philanthropy, as part of our RFP on benchmarks for LLM agents. ↩︎
Four of the five results that OpenAI reports are precisely 39.4%, which seems somewhat unlikely to happen by chance (although the dataset also only has 41 questions). Maybe something is off with OpenAI’s measurement? ↩︎
Think of this as similar to the difference between an AI writing a lot of code that works by itself versus helping a user write a first draft and then iteratively debugging it until it works. ↩︎
It’s hard to put together the details of the long-form biothreat information test because they are scattered across a few different sources. But a December post suggested the questions similarly took humans 25-40 minutes to answer. The GPT-4o system card in August reported that experts only score 30-50% with the Internet; whilst the model seemed to increase novice performance from 20-30% to 50-70%. The o1-preview system card in September then reported that GPT-4o –without any mention of novcies or experts– scored ~0%. Of course, it could be that OpenAI changed the questions over that month or scored the answers differently; they don’t say if that was the case. Still, I think it helps to illustrate that having a novice “in the loop” or not might matter a lot.
↩︎
Note that the OpenAI report also does not comment on how it deals with the risk of what would happen if o1’s model weights were to leak, in which case having a safeguard by limiting API access would no longer work. Of course, the probability of such a leak and it resulting in a terrorist attack might be very low. ↩︎
The reports says “the evaluations described in this System Card pertain to the full family of o1 models”, which might imply they do not intend to re-run these results for future expansions of o1. It’s also worth noting that the website currently seems to apply the scorecard to “o1”, not “o1-preview” and “o1-mini” specifically. ↩︎
↩︎
Surprisingly, o1-preview apparently scored exactly as well as GPT-4o, and seemingly worse than some other older models (‘old’ Claude 3.5 scored ~50%; Llama 3.1 ~42%), so there might be a lot of headroom here. ↩︎

Dangerous capability tests should be harder

Luca Righetti — Tue, 20 Aug 2024 15:42:19 GMT

Audio automatically generated by an AI trained on Luca’s voice.

Imagine you’re the CEO of an AI company and you want to know if the latest model you’re developing is dangerous. Some people have argued that since AIs know a lot of biology now — scoring in the top 1% of Biology Olympiad test-takers — they could soon teach terrorists how to make a nasty flu that could kill millions of people. But others have pushed back that these tests only measure how well AIs can regurgitate information you could have Googled anyway, not the kind of specialized expertise you’d actually need to design a bioweapon. So, what do you do?

Say you ask a group of expert scientists to design a much harder test — one that’s ‘Google-proof’ and focuses on the biology you’d need to know to design a bioweapon. The UK AI Safety Institute did just that. They found that state-of-the-art AIs still performed impressively — as well as biology PhD students who spent an hour on each question and could look up anything they wanted online.

Does that mean your AI can teach a layperson to create bioweapons? Is this result really scary enough to convince you that, as some people have argued, you need to make sure not to openly share your model weights, lock them down with strict cybersecurity, and do a lot more to make sure your AI refuses harmful requests even when people try very hard to jailbreak it? Is it enough to convince you to pause your AI development until you’ve done all that?

Well, no. Those are really costly actions, not just for your bottom line but for everyone who’d miss out on the benefits of your AI. The test you ran is still pretty easy compared to actually making a bioweapon. For one thing, your test was still just a knowledge test. Making anything in biology, weapon or not, requires more than just recalling facts. It involves designing detailed, step-by-step plans (known as “protocols”) and tailoring them to a specific laboratory environment. As molecular biologist Erika DeBenedicts explains:

Often if you’re trying a new protocol in biology you may need to do it a few times to ‘get it working.’ It’s sort of like cooking: you probably aren’t going to make perfect meringues the first time because everything about your kitchen — the humidity, the dimensions, and power of your oven, the exact timing of how long you whipped the egg whites — is a little bit different than the person who wrote the recipe.

Just because your AI knows a lot of obscure virology facts doesn’t mean that it can put together these recipes and adapt them on the fly.

So you could ask your experts to design a test focused on debugging protocols in the kinds of situations a wet-lab biologist might find themselves in. Experts can give an AI a biological protocol, describe what goes wrong when somebody attempts it, and see if the AI correctly troubleshoots the problem. The AI-for-science startup Future House did this,^[1] and found that AIs performed well below the level of a PhD researcher on these kinds of problems.^[2]

Now you can breathe a sigh of relief and release the model as planned — even if your AI knows a lot of esoteric facts about virus biology, it probably won’t be much help to any terrorists if it’s not good enough at dealing with real protocols.^[3]

But let’s think ahead. Suppose next year your latest AI passes this test. Does that mean your AI can teach a layperson to create bioweapons?

Well…maybe. Even if an AI can accurately diagnose an expert’s issues, a layperson might not know what questions to ask in the first place or lack the tacit knowledge to act on the AI’s advice. For example, someone who has never pipetted before might struggle to measure microliters precisely or contaminate the tip when touching a bottle. Acquiring these skills often takes months of learning from experienced scientists — something terrorists can’t easily do.

So you could ask your experts to design a test to see if AI can also proactively mentor a layperson. For example, you could create biology challenges in an actual wet lab and compare how people do with AI versus just the internet. OpenAI announced they intend to run what seems to be a study like this.

What if that study finds that your AI does indeed help with the wet-lab challenges you designed? Does that (finally) mean your AI can teach a layperson to create bioweapons?

Again, it’s not obvious. Some biosecurity experts might freak out (or already did a few paragraphs ago). But others might still raise credible objections:

Your challenges might not have been hard enough. Maybe your AI can teach someone to make a relatively harmless virus (e.g. an adenovirus that causes a mild cold) but still not something truly scary (e.g. smallpox, which has a more fragile genome and requires more skill to assemble).
Most terrorists don’t have access to legitimate labs. Maybe your AI can help someone with a standardized professional set-up, but not someone forced to work in a less-sterile ‘garage’ that lacks the advanced tools that let you shortcut some steps and instead requires a lot of unusual troubleshooting.
Walking someone through implementing the actual biology part might be necessary but not sufficient to cause a catastrophe. A wannabe terrorist might face other huge barriers in the risk chain, like planning attacks or acquiring materials.

All these tests have a weird one-directionality to them: If an AI fails, it’s probably safe; but if it succeeds, it’s still not clear whether it’s actually dangerous. As newer models pass the older easy dangerous capability tests, companies ratchet up the difficulty, making these tests gradually harder over time.

But that puts us in a precarious situation. The pace of AI progress has surprised us before,^[4] and AI company execs have argued that AI models could become extremely powerful in a couple of years. If they’re right, then as soon as 2025 or 2026, we might see AIs match expert performance on all the dangerous capabilities tests we’ve built by then – but many decision-makers might still think the evidence is too flimsy to justify locking down weights, pausing, or taking other costly measures. If the AI is, in fact, dangerous, we may not have any tests ready to convince them of that.

So, let’s work backwards. What would it take for a test to convincingly measure whether an AI can, in fact, teach a layperson how to build biological weapons? What kind of test could legitimately justify making AI companies take extremely costly measures?^[5]

Here’s a hypothetical ‘gold standard’ test: we do a big randomized controlled trial to see if a bunch of non-experts can actually create a (relatively harmless) virus from start to finish. Half the people would have AI mentors and the other half can only look stuff up on the internet. We’d give each participant $50K and access to a secure wet-lab set up like a garage lab, and make them do everything themselves: find and adapt the correct protocol, purchase the necessary equipment, bypass any know-your-customer checks, and develop the tacit skills needed to run experiments, all on their own. Maybe we give them three months and pay a bunch of money to anyone who can successfully do it.

This kind of test would be way more expensive and time-consuming to design and run than anything companies have announced so far. But it has a much better shot at changing minds. I could actually imagine experts and decision-makers agreeing that if an AI passes this kind of test, then it poses massive risks and thus companies should have to pay massive costs to get those risks under control.

And even if this exact test turns out to be too impractical (or unethical) to be worth it, we need to agree in advance on some tests that are hard enough and realistic enough that they clearly justify action. I reckon we’re much better off working backward from a hypothetical gold standard test, even if it means making major adjustments,^[6] than continuing to ratchet forward without a clear plan.^[7]

Designing actually hard dangerous capability tests will be a huge lift, and it’ll take several iterations to get them right.^[8] But that just means we need to start now. We should spend less time proving that today’s AIs are safe and more time figuring out how to tell if tomorrow’s AIs are dangerous.

Open Philanthropy funded the development of this benchmark as part of its RFP on difficult benchmarks for LLM agents (Ajeya Cotra, who edits this blog, was the grant investigator). ↩︎
However, as the Future House study notes a major limitation of this study was that “human evaluators [...] were permitted to utilize tools, whereas the models were not provided with such resources”. Thus, it could be that AIs with web-search enabled do a lot better. It could also be that the model performs much better if it’s fine-tuned on similar questions. ↩︎
Clymer et al. (2024) call this an ‘inability argument’ — a safety case that relies on showing that “AI systems are incapable of causing unacceptable outcomes in any realistic setting.” ↩︎
In cybersecurity risk, Google Project Zero found that upon moving from GPT-3.5-Turbo (in the original paper) to GPT-4-Turbo (with Naptime), AI’s ability to zero-shot discover and exploit memory safety issues hugely improved – going from scoring 2% to 71% on buffer overflow tests. The authors concluded “To effectively monitor progress, we need more difficult and realistic benchmarks, and we need to ensure that benchmarking methodologies can take full advantage of LLMs' capabilities.” In biorisk, UK AISI reported that its “in-house research team analysed the performance of a set of LLMs on 101 microbiology questions between 2021 and 2023. In the space of just two years, LLM accuracy in this domain has increased from ~5% to 60%.” And, as noted, in 2024 AIs performed as well as PhD students on an even more advanced test. They now need to “assess longer horizon scientific planning and execution” and “also [run] human uplift studies”. ↩︎
As Narayanan and Kapoor note: “Justification is essential to the legitimacy of government and the exercise of power. A core principle of liberal democracy is that the state should not limit people's freedom based on controversial beliefs that reasonable people can reject. Explanation is especially important when the policies being considered are costly, and even more so when those costs are unevenly distributed among stakeholders.” ↩︎
For example, to ensure participants are safe enough, we might task them with creating a virus that we know will be defective and, at worst, cause mild symptoms that can be treated – such as RSV. An expert could oversee what they do and intervene before anything harmful happens. Furthermore, it seems plausible to separate out some especially dangerous steps and have these completed by a trusted red team working with law enforcement. For example, steps involving ideating dangerous designs or bypassing synthesis DNA screening to obtain especially hazardous materials. ↩︎
For instance, OpenAI’s blueprint for biorisk had participants complete written tasks, and if an expert scored their answers at least 8/10, it was seen as a sign of increased concern. But the authors note this number was chosen fairly arbitrarily and depends heavily on who is doing the judging. Setting a threshold “turns out to be difficult.” ↩︎
Even here, I imagine that readers might find objections or disagree on how to set things up. Who counts as non-experts? Some viruses are harder to make than others—how do we know what virus to task people with? Would 5% of people succeeding be scary enough to warrant drastic action? Would 50%? ↩︎

Scale, schlep, and systems

Ajeya Cotra — Tue, 10 Oct 2023 16:49:09 GMT

Kelsey Piper co-drafted this post. Thanks also to Isabel Juniewicz for research help.

In January 2022, language models were still a pretty niche scientific interest. Once ChatGPT was released in November 2022, it accumulated a record-breaking 100 million users by February 2023. Many of those users were utterly flabbergasted by how far AI had come, and how fast. And every way we slice it, most experts were very surprised as well.

This startlingly fast progress was largely driven by scale and partly driven by schlep.

Scale involves training larger language models on larger datasets using more computation, and doing all of this more efficiently^[1] over time. “Training compute,” measured in “floating point operations” or FLOP, is the most important unit of scale. We can increase training compute by simply spending more money to buy more chips, or by making the chips more efficient (packing in more FLOP per dollar). Over time, researchers also invent tweaks to model architectures and optimization algorithms and training processes to make training more compute-efficient — so each FLOP spent on training goes further in 2023 compared to 2020.

Scale has been by far the biggest factor in the improvements in language models to date. GPT-4 is bigger than GPT-3.5 which is bigger than GPT-3 which is bigger than GPT-2 which is bigger than GPT.^[2]

Increasing scale automatically improves performance on pretty much every test of skill or practically useful task. GPT-2 got an F- on college-level multiple choice tests ranging from abstract algebra to business ethics; GPT-4 got a B+. GPT-2 was just starting to string together plausible-sounding paragraphs; GPT-4 can write essays that net a B+ at Harvard — and hundreds of lines of functioning code that can take human programmers hours to reproduce.

If you add more data, more parameters, and more compute, you’ll probably get something that is a lot better yet. GPT-4.5 will perform much better than GPT-4 on most tests designed to measure understanding of the world, practical reasoning in messy situations, and mathematical and scientific problem-solving. A whole lot of things GPT-4 struggles with will probably come easily to GPT-4.5. It will probably generate a whole lot more economic value, and present much bigger societal risks. And then the same thing will happen all over again with GPT-5. We think the dramatic performance improvements from scale will continue for at least another couple of orders of magnitude — as Geoffrey Hinton joked in 2020, “Extrapolating the spectacular performance of GPT-3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.”

But even if no one trained a larger-scale model than GPT-4, and its basic architecture and training process never got any more efficient, there would still probably be major economic change from language models over the next decade. This is because we can do a lot of schlep to better leverage the language models we already have and integrate them into our workflows.

By schlep, we mean things like prompting language models to give higher-quality answers or answers more appropriate to a certain use case, addressing annoying foibles like hallucination with in-built fact-checking and verification steps, collecting specialized datasets tailored to specific tasks and fine-tuning language models on these datasets, providing language models with tools and plug-ins such as web search and code interpreters, and doing a lot of good old fashioned software engineering to package all this into sleek usable products like ChatGPT.

For one example of what schlep can do to improve language models, take chain of thought prompting. Chain of thought prompting is dead simple — it’s basically the same thing as your teacher reminding you to show your work — and it substantially improves the performance of language models on all kinds of problems, from mathematical problem-solving to ‘common sense’ reasoning.

By default, language models can only answer questions by immediately spitting out the first word of their answer, then rolling with whatever they said and immediately spitting out the next word, and so on word-by-word until eventually they’ve completed their thought. They are unable to backtrack or get any thinking done outside of the next-word rhythm.

Imagine if you had to take a standardized test like that: by answering every question on the spot without any backtracking, as if you were being interviewed on live TV. For one thing, it would be very hard! For another, you’d probably do a lot better if you verbalized your reasoning step by step than if you just tried to blurt out the final answer. This is exactly what we see in chain-of-thought.

This suggests chain-of-thought is probably not just a one-off trick, but an instance of a more general pattern: language models will probably perform better if they can spend more effort iterating on and refining their answers to difficult questions, just like humans get to do. In fact, simply allowing language models to spit out a “pause and think longer” symbol rather than having to commit to the next word immediately also seems to improve performance.

Scale makes language models better. Techniques like chain-of-thought improve models at any given scale. But it’s more than that: chain of thought prompting only works at all on sufficiently large language models, and the returns are greater when it’s used on bigger models which were already more powerful to begin with.^[3]

You could imagine a variety of more elaborate techniques that follow the same principle. For example, you could imagine equipping an LLM with a full-featured text editor, allowing it to backtrack and revise its work in the way humans do.^[4] Or you could imagine giving a language model two output streams: one for the answer, and one for its stream-of-consciousness thoughts about the answer it’s writing. Imagine a language model that had the option to iterate for days on one question, writing notes to itself as it mulls things over. The bigger and more capable it is, the more use it could get out of this affordance. Giving a genius mathematician a scratchpad will make a bigger difference than giving one to a six year old.

Another technique that dramatically improves performance of models at a given size is fine-tuning: retraining a large model on a small amount of well-chosen data in order to get much better-targeted responses for a given use case. For example, reinforcement learning from human feedback (RLHF) involves fine-tuning models on whether human raters liked its answer. Without RLHF, a language model might riff off a question or generate the kind of text that would surround that question on a website, instead of answering it. In effect, RLHF gets the model to ‘realize’ that it should be aiming to use its capabilities and knowledge to answer questions well rather than predict text on the internet: two different tasks, with a lot of overlapping skills.

With prompting and fine-tuning and a lot of other schlep, we can build systems out of language models. ChatGPT is a very simple and very familiar example of a language model system. To understand all of the ways that ChatGPT is a system, not just a language model, it’s useful to compare it to InstructGPT, which is the same basic underlying technology. Before ChatGPT was released, InstructGPT was available to people who made an account to test it out in OpenAI’s playground.

Here’s the UI for OpenAI’s playground today (it was worse a year ago but unfortunately we don’t have a screenshot of it from then):

If you’re an experienced LLM expert, the option to customize temperature and adjust the frequency penalty and presence penalty, add stop sequences, and so on is really useful. If you’re a random consumer, all of that is intimidating. ChatGPT’s UI abstracts it away:

That UI difference might seem small. But InstructGPT was mostly used only by a small community of researchers, and ChatGPT reached 100 million users within two months of launch. The difference between the two products was essentially presentation, user experience, and marketing. That kind of thing can result in massive differences in actual user behavior. Some of the work that goes into making language model systems, then, is about figuring out how to make the model usable for users.

If language models are like engines, then language model systems would be like cars and motorcycles and jet planes. Systems like Khan Academy’s one-on-one math tutor or Stripe’s interactive developer docs would not be possible to build without good language models, just as cars wouldn’t be possible without engines. But making these products a reality also involves doing a lot of schlep to pull together the “raw” language model with other key ingredients, getting them all to work well together, and putting them in a usable package. Similarly, self-driving cars would not be possible without really good vision models, but a self-driving car is more than just a big vision neural network sitting in a server somewhere.

One kind of language model system that has attracted a lot of attention and discussion is a language model agent.

An agent is a system which independently makes decisions and acts in the world. A language model is not an agent, but language models can be the key component powering a system which is agentic and takes actions in the world. The most famous early implementation of this is Auto-GPT, a very straightforward and naive approach: you can tell it a goal, and it will self-prompt repeatedly to take actions towards this goal. People have already employed it towards a wide range of goals, including building ChaosGPT, which has the goal of destroying humanity.

Auto-GPT is not very good. Users have complained that it constantly comes up with reasons to do more research and is reluctant to decide that it’s done enough research and can actually do the task now. It’s also just not very reliable. But there are many people building agentic language models for commercial uses, and working to solve all of these shortcomings, including well-funded and significantly-sized companies like Imbue and Adept. Adding chain of thought prompting, fine tuning the underlying language models, and many similar measures will likely make agents a lot better – and, of course, increasing scale will make them better too.

We’re really at the very beginning of this work. It wouldn’t be surprising to see major advances in the practical usefulness of LLMs achieved through schlep alone, such that agents and other systems built out of GPT-4 tier models are much more useful in five years than they are today. And of course, we are continuing to scale up models at the same time. That creates the conditions for rapid improvements along many dimensions at once — improvements which could reinforce each other. Many people will be trying hard to make this a reality. Even if specific approaches meet dead-ends, the field as a whole doesn’t seem likely to.

Here, we’re wrapping together algorithmic progress with scale for simplicity; this could have been broken out into its own type of progress. ↩︎
The graph pictured in the text is by Our World in Data, pulling numbers from our grantee organization Epoch, the original GPT was trained on 1.8e19 FLOP; GPT-2 was 1.5e21 FLOP; the largest version of GPT-3 (175 billion parameters) was trained on 3.1e23 FLOP, and GPT-4 was trained on ~2.1e25 FLOP. There isn’t clear documentation of GPT-3.5, the model in between GPT-3 and GPT-4, but we suspect that it was retrained from scratch and its effective training compute was order 10^24 FLOP. ↩︎
See e.g. Wei et al 2023 Figures 4, 7, and 8. ↩︎
Perhaps this would be based on terminal-based text editors that programmers use, which do everything via keyboard commands. ↩︎

Language models surprised us

Ajeya Cotra — Tue, 29 Aug 2023 18:37:14 GMT

Kelsey Piper co-drafted this post. Thanks also to Isabel Juniewicz for research help. Audio automatically generated by an AI trained on our voices.

If you read media coverage of ChatGPT — which called it ‘breathtaking’, ‘dazzling’, ‘astounding’ — you’d get the sense that large language models (LLMs) took the world completely by surprise. Is that impression accurate?

Actually, yes. There are a few different ways to attempt to measure the question “Were experts surprised by the pace of LLM progress?” but they broadly point to the same answer: ML researchers, superforecasters,^[1] and most others were all surprised by the progress in large language models in 2022 and 2023.

Competitions to forecast difficult ML benchmarks

ML benchmarks are sets of problems which can be objectively graded, allowing relatively precise comparison across different models. We have data from forecasting competitions done in 2021 and 2022 on two of the most comprehensive and difficult ML benchmarks: the MMLU benchmark and the MATH benchmark.^[2]

First, what are these benchmarks?

The MMLU dataset consists of multiple choice questions in a variety of subjects collected from sources like GRE practice tests and AP tests. It was intended to test subject matter knowledge in a wide variety of professional domains. MMLU questions are legitimately quite difficult: the average person would probably struggle to solve them.

At the time of its introduction in September 2020, most models only performed close to random chance on MMLU (~25%), while GPT-3 performed significantly better than chance at 44%. The benchmark was designed to be harder than any that had come before it, and the authors described their motivation as closing the gap between performance on benchmarks and “true language understanding”:

Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks. However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models.

Meanwhile, the MATH dataset consists of free-response questions taken from math contests aimed at the best high school math students in the country. Most college-educated adults would get well under half of these problems right (the authors used computer science undergraduates as human subjects, and their performance ranged from 40% to 90%).

At the time of its introduction in January 2021, the best model achieved only about ~7% accuracy on MATH. The authors say:

We find that accuracy remains low even for the best models. Furthermore, unlike for most other text-based datasets, we find that accuracy is increasing very slowly with model size. If trends continue, then we will need algorithmic improvements, rather than just scale, to make substantial progress on MATH.

So, these are both hard benchmarks — the problems are difficult for humans, the best models got low performance when the benchmarks were introduced, and the authors seemed to imply it would take a while for performance to get really good.

In mid-2021, ML professor Jacob Steinhardt ran a contest^[3] with superforecasters at Hypermind to predict progress on MATH and MMLU.^[4] Superforecasters massively undershot reality in both cases.

They predicted that performance on MMLU would improve moderately from 44% in 2021 to 57% by June 2022. The actual performance was 68%, which superforecasters had rated incredibly unlikely.

Shortly after that, models got even better — GPT-4 achieved 86.4% on this benchmark, close to the 89.8% that would be “expert-level” within each domain, corresponding to 95th percentile among human test takers within a given subtest.

Superforecasters missed even more dramatically on MATH. They predicted the best model in June 2022 would get ~13% accuracy, and thought it was extremely unlikely that any model would achieve >20% accuracy. In reality, the best model in June 2022 got 50% accuracy,^[5] performing much better than the majority of humans.

Did ML researchers do any better? Steinhardt himself did worse in 2021. In his initial blog post, Steinhardt remarked that the superforecasters’ predictions on MATH were more aggressive (predicting faster progress) than his own.^[6] We haven’t found any similar advance predictions from other ML researchers, but Steinhardt’s impression is that he himself anticipated faster progress than most of his colleagues did.

However, ML researchers do seem to be improving in their ability to anticipate progress on these benchmarks. In mid-2022, Steinhardt registered his predictions for MATH and MMLU performance in July 2023, and performed notably better: “For MATH, the true result was at my 41st percentile, while for MMLU it was at my 66th percentile.” Steinhardt also argues that ML researchers performed reasonably well on forecasting MATH and MMLU in the late-2022 Existential Risk Persuasion Tournament (XPT) (though superforecasters continued to underestimate benchmark progress).

Expert surveys about qualitative milestones

Not all forms of progress can be easily captured in quantifiable benchmarks. Often we care more about when AI systems will achieve more qualitative milestones: when will they translate as well as a fluent human? When will they beat the best humans at Starcraft? When will they prove novel mathematical theorems?

Katja Grace of AI Impacts asked ML experts to predict a wide variety of AI milestones, first in 2016 and then again in 2022.

In 2016, the ML experts were reasonably well-calibrated, but the predictions followed a clear pattern: progress in gameplay and robotics advanced slower than expected, but progress in language use (including programming) advanced more quickly than expected.^[7]

The second iteration of the survey was conducted in mid-2022, a few months before ChatGPT was released. This time accuracy was lower — experts failed to anticipate the progress that ChatGPT and GPT-4 would soon bring. These models achieved milestones like “Write an essay for a high school history class” (actually GPT-4 does pretty well in college classes too) or “Answer easily Googleable factual but open-ended questions better than an expert” just a few months after the survey was conducted, whereas the experts expected them to take years.

That means that even after the big 2022 benchmark surprises, experts were still in some cases strikingly conservative about anticipated progress, and undershooting the real situation.

Anecdata of researcher impressions

ML researchers rarely register predictions, so AI Impacts’ surveys are the best systematic evidence we have about what ML researchers expected ahead of time about qualitative milestones. Anecdotally though, a number of ML experts have expressed that they (and the ML community broadly) were surprised by ChatGPT and GPT-4.

For a long time, famous cognitive scientist Douglas Hofstadter was among those predicting slow progress. “I felt it would be hundreds of years before anything even remotely like a human mind”, he said in a recent interview.

Now? “This started happening at an accelerating pace, where unreachable goals and things that computers shouldn't be able to do started toppling. …systems got better and better at translation between languages, and then at producing intelligible responses to difficult questions in natural language, and even writing poetry. …The accelerating progress, has been so unexpected, so completely caught me off guard, not only myself but many, many people, that there is a certain kind of terror of an oncoming tsunami that is going to catch all humanity off guard.”

Similarly, during a Senate Judiciary Committee hearing last month, acclaimed leading AI researcher Yoshua Bengio said “I and many others have been surprised by the giant leap realized by systems like ChatGPT.”

In my role as a grantmaker, I’ve heard many ML academics express similar sentiments in private over the last year. In particular, I’ve spoken to many researchers who were specifically surprised by the programming and reasoning abilities of GPT-4 (even after seeing the capabilities of the free version of ChatGPT).

Another surprise ahead?

In 2021, most people were systematically and severely underestimating progress in language models. After a big leap forward in 2022, it looks like ML experts improved in their predictions of benchmarks like MMLU and MATH — but many still failed to anticipate the qualitative milestones achieved by ChatGPT and then GPT-4, especially in reasoning and programming.

I think many experts will soon be surprised yet again. Most importantly, ML experts and superforecasters both seem to be massively underestimating future spending on training runs. In the XPT tournament mentioned earlier, both groups predicted that the most expensive training run in 2030 would only cost around $100-180M.^[8] Instead, I think that the largest training run will probably cross $1 billion by 2025. This rapid scaleup will probably drive another qualitative leap forward in capability like what we saw over the last 18 months.

I’d be really excited for ML researchers to register their forecasts about what AI systems built on language models will be able to do in the next couple of years. I think we need to get good at predicting what language models will be able to do — in the real world, not just on benchmarks. Massively underestimating near-future progress could be very risky.

The term "superforecaster" was originally coined by Philip Tetlock andand popularized in his book Superforecasting, though I use the term more generically here to mean "a person who consistently outperforms domain experts and other forecasters by a large amount in forecasting world events." ↩︎
Dan Hendrycks is the first author on both benchmarks; he did this work while he was a graduate student under Open Philanthropy grantee Jacob Steinhardt. Dan now runs the Center for AI Safety (CAIS), which Open Philanthropy has also funded. ↩︎
Open Philanthropy funded this forecasting contest. ↩︎
His contest consisted of six questions, of which two were forecasting MATH and MMLU. The other questions were not about language model capabilities (two were about vision, and others were questions about inputs to AI progress). ↩︎
Interestingly, the paper that achieved this milestone (Minerva) was published just one day before the deadline, on June 29, 2022. According to the Minerva paper, the previous published result was the same ~7% accuracy that was reported in the original MATH paper (though the paper claims an unpublished result of ~20%). This means progress on MATH turned out to be pretty “lumpy,” jumping a large amount with just one paper. ↩︎
In fact, in 2021, Steinhardt was surprised that forecasters predicted that models would achieve 50% performance on MATH by 2025. “I'm still surprised that forecasters predicted 52% on MATH [by 2025], when current accuracy is 7% (!). My estimate would have had high uncertainty, but I'm not sure the top end of my range would have included 50%.” As we said above, 50% was achieved in _2022. _ ↩︎
For example, one of the milestones was “Write Python code to implement algorithms like quicksort.” Experts predicted that would happen around 2026, but actually it happened in 2021 — and by 2022 language models could write much more complex pieces of code than quicksort. ↩︎
Page 57, "Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament" ↩︎

Could AI accelerate economic growth?

Tom Davidson — Tue, 06 Jun 2023 21:45:42 GMT

Audio automatically generating using an AI trained on Tom’s voice.

What will happen to economic growth once AI has made us all obsolete? Economists are often skeptical of big effects.

One reason they give is that new technologies typically don’t accelerate economic growth. Instead, they typically cause a one-time gain in economic output, and then growth continues at its normal rate.

For example, Bryan Caplan, an econ prof at GMU, recently tweeted:

Tech moved 10x faster than I expected in the last year…
Economic effects will be modest & gradual. Even electricity took decades to make a huge difference…
What I doubt is that any one new tech will raise growth by even 1 percentage-point per year.

This dynamic has played out over the past 50 years. We developed computers and the internet, but economic growth didn’t speed up, and if anything it got slower.

I think this is the right way to understand the economic impact of current AI. GPT-4 will raise productivity in many sectors as it is gradually adopted across the economy, but it won’t permanently accelerate economic growth by itself.

But this doesn’t mean it’s impossible to ever accelerate economic growth. In fact, economic growth has become much faster over the past 2000 years. Over the last few decades, the global economy has grown at about ~3% per year. But around 1800 it grew more slowly, at about ~1% per year. And earlier in time growth was slower still, below 0.1% per year if you go back far enough.

From David Roodman’s Modeling the Human Trajectory. The solid line is historical economic growth from 10,000 BC to 2019. If the rate of growth were constant, this graph should look like a straight line. But it curves upwards, showing economic growth accelerating over time.

So, if new technologies don’t ever really accelerate economic growth, why is growth so much higher today than it was 2000 years ago? The consensus view in economics is that modern growth is so fast because we put continual effort into innovation^[1]. We deliberately invest in R&D to invent new technologies, we put effort into making our manufacturing processes more efficient, we design supply chains to efficiently distribute new technologies quickly across the economy, and so on.

The world is collectively putting more effort into innovation now compared to 2000 years ago for two main reasons: we have more people overall, and a larger fraction of those people are working on developing new technologies.

On the first point: The global population is about thirty times larger than it was 2000 years ago, so there are more people who can potentially come up with ideas for new technologies, and more people who can work to make them a reality.
On the second point: a larger fraction of the population today specializes in R&D for new technologies.
- One big reason for this is better education — 2000 years ago almost nobody received an academic education; today, mass education means that a larger fraction of the population has the background skills needed to contribute to technology R&D.^[2]
- Another major reason is better institutions for encouraging and enabling innovation. For example, in the past you had to be independently wealthy to be an inventor, because you had to fund your own research. But today, investment markets and government grants will often finance promising ideas, so you can try to invent new technologies even if you couldn’t fund all the research yourself.

Today around ~20 million people work in R&D worldwide.^[3] Two thousand years ago the effort going into research was much smaller, I’d guess by a factor of 1000.^[4] Economic growth is much faster today than it was 2000 years ago not because of any single new technology, but because we are putting in so much more effort into generating a steady stream of new innovations.

I think future AI might massively increase the world’s innovation efforts again, and thereby accelerate economic growth. Rather than being “just one more technology”, it might massively increase the pace at which humanity develops new technologies.

How much might future AI increase the world’s innovation efforts?

As I discussed in a previous post, once we develop AI that is “expert human level” at AI research, it might not be long before we have AI that is way beyond human experts in all domains. That is, AI that is way better than the best humans at thinking of new ideas, designing experiments to test those ideas, building new technologies, running organizations, and navigating bureaucracies.^[5]

What’s more, because it takes so many more computer chips to train powerful AI than to run it, once we’ve trained these superhuman AIs we would potentially have enough computation to run them on billions of tasks in parallel.^[6] There could be massive research organizations where AIs manage other AIs to conduct millions of research projects in parallel. And these AIs could innovate tirelessly day and night.^[7]

As well as having superhuman intelligence, these AIs could think much more quickly than humans. ChatGPT Turbo can already write ~800 words per minute, whereas humans typically write about 40 words per minute. So AI can already write ~20X faster than humans. In just one day, each AI could potentially think as many thoughts as a human thinks in a month.^[8]

My best guess is that AI this powerful would increase the world’s innovative efforts by more than 100X. Perhaps, like before, this would significantly accelerate economic growth.

In standard neoclassical models growth is ultimately driven by better technology, which the model assumes improves exponentially. Semi-endogenous models go further in modeling technological progress as resulting from targeted R&D efforts. Other models take a different approach and represent technological progress as driven by “learning by doing”. In all these models, growth is ultimately driven by innovation.
There are some growth models where growth is ultimately driven by capital accumulation rather than technological progress. But these aren’t particularly popular and they must deny that there are diminishing returns to capital on the current margin. Interpreted at face value, they imply that developed countries became richer over the past 100 years solely by producing more of the goods that already existed 100 years ago, rather than by developing higher quality goods and new technologies.
I do think standard growth models miss part of the picture, which is that growth in GDP/capita has been in part driven by one-time changes like increased workforce participation by women. ↩︎
Better education also means that the people doing R&D are more skilled on average. ↩︎
OECD data from 2015 estimates that the global workforce in science and engineering is 7 million, though this omits India and Brazil. A recent paper estimates ~20 million full-time equivalents do R&D worldwide. ↩︎
Why a factor of 1000? Population was ~30X lower, and various data sources suggest research concentration in 1800 was 30X lower than today (see the subsection “Data on the research concentration in 1800” in this report). Combining those factors, the number of researchers 2000 years ago was lower than today by a factor of 30*30 = ~1000X.
In fact, the fraction of people doing research 2000 years ago was likely lower than in 1800, suggesting an even bigger difference than 1000X. On the other hand, research effort 2000 years ago may have come more from many people making small innovations in their personal workflows than from full time researchers. ↩︎
There are some innovative activities that disembodied AI couldn’t automate, because they require interacting with the physical world. This could significantly limit AI’s effect on economic growth. On the other hand, AI might design robots that can do all the physical tasks that humans can do. Then AIs could control these robots remotely and perform all the tasks involved in innovation. ↩︎
Of course, we don’t know how much compute superhuman AI would take to train or to run. To guess at this, I estimated how many tasks GPT-8 could perform in parallel using only the computer chips needed to train it.
In a previous post I estimated that GPT-4 could perform 300,000 tasks in parallel with the compute used to train it. Compared to GPT-4, I assumed that GPT-8 would have 10,000 times as many parameters and need 100 million times as much computing power to train (in line with the Chinchilla scaling law). This implies that GPT-8 could perform 10,000X (= 100 million/10,000) as many tasks in parallel. I.e. 3 billion (=300,000 * 10,000) tasks. See calc. ↩︎
What’s more, the size of this AI workforce could grow rapidly over time. AI could work to increase the number of AIs and how smart they are by designing better AI algorithms, designing better AI chips, and investing more money to build more AI chips. Already AI algorithms are becoming about twice as efficient each year, AI chips are becoming twice as efficient every ~2-3 years, and investments in AI are growing quickly. If this pace of improvement keeps up, the size of the AI workforce would more than double every year! This fast-growing workforce could innovate quickly despite ideas becoming harder to find. ↩︎
A previous footnote argued that if we took the computer chips that were used to train superhuman AI and used them to run copies of the superhuman AI, we could run 3 billion copies in parallel. They could work tirelessly day and night, rather than the ~8 hours/day from human workers, which increases the size of the effective AI workforce to 9 billion.
If 2 million out of these 9 billion AIs work on innovation, that is already a 100X increase on the current size of the R&D workforce (~20 million – see previous footnote). But there are two reasons the increase in innovative effort will be bigger than this. Firstly, each superhuman AI is much more smart and productive than the best R&D workers today. This is a massive effect, bigger than turning every scientist alive today into a top performer in their field. Lastly, there’s a large gain in productivity from the AIs being able to think faster. Rather than having 2 billion AIs thinking at human speed, we could have 100 million AIs thinking at 20X human speed. ↩︎

The costs of caution

Kelsey Piper — Mon, 01 May 2023 17:34:19 GMT

Audio automatically generating using an AI trained on Kelsey’s voice.

Josh Cason on Twitter raised an objection to recent calls for a moratorium on AI development:

Or raise your hand if you or someone you love has a terminal illness, believes Ai has a chance at accelerating medical work exponentially, and doesn't have til Christmas, to wait on your make believe moratorium. Have a heart man ❤️ https://t.co/wHK86uAYoA
— Josh Cason (@TheGrizztronic) April 2, 2023

I’ve said that I think we should ideally move a lot slower on developing powerful AI systems. I still believe that. But I think Josh’s objection is important and deserves a full airing.

Approximately 150,000 people die worldwide every day. Nearly all of those deaths are, in some sense, preventable, with sufficiently advanced medical technology. Every year, five million families bury a child dead before their fifth birthday. Hundreds of millions of people live in extreme poverty. Billions more have far too little money to achieve their dreams and grow into their full potential. Tens of billions of animals are tortured on factory farms.

Scientific research and economic progress could make an enormous difference to all these problems. Medical research could cure diseases. Economic progress could make food, shelter, medicine, entertainment and luxury goods accessible to people who can't afford it today. Progress in meat alternatives could allow us to shut down factory farms.

There are tens of thousands of scientists, engineers, and policymakers working on fixing these kinds of problems — working on developing vaccines and antivirals, understanding and arresting aging, treating cancer, building cheaper and cleaner energy sources, developing better crops and homes and forms of transportation. But there are only so many people working on each problem. In each field, there are dozens of useful, interesting subproblems that no one is working on, because there aren’t enough people to do the work.

If we could train AI systems powerful enough to automate everything these scientists and engineers do, they could help.

As Tom discussed in a previous post, once we develop AI that does AI research as well as a human expert, it might not be long before we have AI that is way beyond human experts in all domains. That is, AI which is way better than the best humans at all aspects of medical research: thinking of new ideas, designing experiments to test those ideas, building new technologies, and navigating bureaucracies.

This means that rather than tens of thousands of top biomedical researchers, we could have hundreds of millions of significantly superhuman biomedical researchers.^[1]

That’s more than a thousand times as much effort going into tackling humanity’s biggest killers. If you thought we might be able to cure cancer in 2200, then I think you ought to expect there’s a good chance we can do it within years of the advent of AI systems that can do the research work humans can do.^[2]

SMBC 2013-06-02 "The Falling Problem", Zach Wienersmith

All this may be a massive underestimate. This envisions a world that’s pretty much like ours except that extraordinary talent is no longer scarce. But that feels, in some senses, like thinking about the advent of electricity purely in terms of ‘torchlight will no longer be scarce’. Electricity did make it very cheap to light our homes at night. But it also enabled vacuum cleaners, washing machines, cars, smartphones, airplanes, video recording, Twitter — entirely new things, not just cheaper access to things we already used.

If it goes well, I think developing AI that obsoletes humans will more or less bring the 24th century crashing down on the 21st. Some of the impacts of that are mostly straightforward to predict. We will almost certainly cure a lot of diseases and make many important goods much cheaper. Some of the impacts are pretty close to unimaginable.

Since I was fifteen years old, I have harbored the hope that scientific and technological progress will come fast enough. I hoped advances in the science of aging would let my grandparents see their great-great-grandchildren get married.

Now my grandparents are in their nineties. I think hastening advanced AI might be their best shot at living longer than a few more years, but I’m still advocating for us to slow down. The risk of a catastrophe there’s no recovering from seems too high.^[3] It’s worth going slowly to be more sure of getting this right, to better understand what we’re building and think about its effects.

But I’ve seen some people make the case for caution by asking, basically, ‘why are we risking the world for these trivial toys?’ And I want to make it clear that the assumption behind both AI optimism and AI pessimism is that these are not just goofy chatbots, but an early research stage towards developing a second intelligent species. Both AI fears and AI hopes rest on the belief that it may be possible to build alien minds that can do everything we can do and much more. What’s at stake, if that’s true, isn’t whether we’ll have fun chatbots. It’s the life-and-death consequences of delaying, and the possibility we’ll screw up and kill everyone.

Tom argues that the compute needed to train GPT-6 would be enough to have it perform tens of millions of tasks in parallel. We expect that the training compute for superhuman AI will allow you to run many more copies still. ↩︎
In fact, I think it might be even more explosive than that — even as these superhuman digital scientists conduct medical research for us, other AIs will be working on rapidly improving the capabilities of these digital biomedical researchers, and other AIs still will be improving hardware efficiency and building more hardware so that we can run increasing numbers of them. ↩︎
This assumes we don’t make much progress on figuring out how to build such systems safely. Most of my hope is that we will slow down and figure out how to do this right (or be slowed down by external factors like powerful AI being very hard to develop), and if we give ourselves a lot more time, then I’m optimistic. ↩︎

Continuous doesn’t mean slow

Tom Davidson — Wed, 12 Apr 2023 16:43:24 GMT

There’s a lot of disagreement about how likely AI is to end up overthrowing humanity. Thoughtful pundits vary from <5% to >90%. What’s driving this disagreement?

One factor that often comes up in discussions is takeoff speeds, which Ajeya mentioned in the previous post. How quickly and suddenly do we move from today’s AI, to “expert-human level” AI^[1], to AI that is way beyond human experts and could easily overpower humanity?

The final stretch — the transition from expert-human level AI to AI systems that can easily overpower all of us — is especially crucial. If this final transition happens slowly, we could potentially have a long time to get used to the obsolescence regime and use very competent AI to help us solve AI alignment (among other things). But if it happens very quickly, we won’t have much time to ensure superhuman systems are aligned, or to prepare for human obsolescence in any other way.

Scott Alexander is optimistic that things might move gradually. In a recent ACX post titled ‘Why I Am Not (As Much Of) A Doomer (As Some People)’, he says:

So far we’ve had brisk but still gradual progress in AI; GPT-3 is better than GPT-2, and GPT-4 will probably be better still. Every few years we get a new model which is better than previous models by some predictable amount.
Some people (eg Nate Soares) worry there’s a point where this changes… Maybe some jump… could take an AI from IQ 90 to IQ 1000 with no (or very short) period of IQ 200 in between…
I’m optimistic because the past few years have provided some evidence for gradual progress.

I agree with Scott that recent AI progress has been continuous and fairly predictable, and don’t particularly expect a break in that trend. But I expect the transition to superhuman AI to be very fast, even if it’s continuous.

The amount of “compute” (i.e. the number of AI chips) needed to train a powerful AI is much bigger than the amount of compute needed to run it. I estimate that OpenAI has enough compute to run GPT-4 on hundreds of thousands of tasks at once.^[2]

This ratio will only become more extreme as models get bigger. Once OpenAI trains GPT-5 it’ll have enough compute for GPT-5 to perform millions of tasks in parallel, and once they train GPT-6 it’ll be able to perform tens of millions of tasks in parallel.^[3]

Now imagine that GPT-6 is as good at AI research as the average OpenAI researcher.^[4] OpenAI could expand their AI researcher workforce from hundreds of experts to tens of millions. That’s a mind-boggling large increase, a factor of 100,000. It’s like going from 1000 people to the entire US workforce. What’s more, these AIs could work tirelessly through the night and could potentially “think” much more quickly than human workers.^[5] (This change won’t happen all-at-once. I expect speed-ups from less capable AI before this point, as Ajeya wrote in the previous post.)

How much faster would AI progress be in this scenario? It’s hard to know. But my best guess, from my recent report on takeoff speeds, is that progress would be much much faster. I think that less than a year after AI is expert-human level at AI research, AI could improve to the point of being able to easily overthrow humanity.

This is much faster than the timeline mentioned in the ACX post:

if you’re imagining specific years, imagine human-genius-level AI in the 2030s and world-killers in the 2040s

The cause of this fast transition isn’t that there’s a break in the trend of continuous progress. It’s that expert-human-level AI massively accelerates AI progress, causing this continuous progress to happen at a blistering pace.

Of course, this isn’t inevitable. Labs could choose not to use AI to accelerate AI progress, at least once AI gets sufficiently powerful. But it will be a tempting move, and they’re more likely to be cautious if they make specific and verifiable commitments in advance to pause AI progress.

I’m operationalizing “expert-human level AI” as “each forward pass of the AI produces as much useful output as 0.1 seconds of thought from a human expert”. It’s possible that AI will produce expert-level output by having many dumber AIs working together and thinking for much longer than a human expert would, but under my definition that wouldn’t count as expert-level AI because the quality of the AI’s thinking is below expert level. ↩︎
My calculation assumes that GPT-4 processes ten tokens per second on each task that it’s being applied to. Here’s how my estimate works: the training compute for GPT-4 has been estimated at ~3e25 total FLOP (source, h/t Epoch). I assume the training took 4 months, implying that amount of compute used per second during training was 3e18 FLOP/s. How many instances of GPT-4 could you run with this compute? If GPT-4 was trained with 3e25 FLOP in accordance with Chinchilla scaling, that implies it will require ~1e12 FLOP per forward pass. So you could do 3e18/1e12 = ~3e6 forward passes per second. In other words, you could run GPT-4 on ~3 million tasks in parallel, with it processing one token per second on each task, or on ~300,000 tasks in parallel at ten tokens per second.
Though there wouldn’t be enough memory to store 300,000 separate copies of the weights, this calculation suggests that there would be enough processing power to apply those weights to 300,000 different word generation tasks per second. (Each copy of the weights can perform many tasks in parallel.) ↩︎
I make the simple assumption that GPT-5 will be the same as GPT-4 except for having 10X the parameters and being trained on 10X the data, and that GPT-6 will have an additional 10X parameters and 10X data. ↩︎
More precisely, assume that each forward pass of GPT-6 is as useful for advancing AI capabilities as 0.1 seconds of thought by an average OpenAI researcher. I.e. if OpenAI has 300 capabilities researchers today, then you could match their total output by running 300 copies of GPT-6 in parallel and having each of them produce 10 tokens per second. ↩︎
Rather than 10 million AIs thinking at human speed, OpanAI could potentially have 1 million AIs thinking 10X faster than a human, or 100,000 AIs thinking 100X faster. ↩︎

AIs accelerating AI research

Ajeya Cotra — Tue, 04 Apr 2023 20:01:47 GMT

The concept of an “intelligence explosion” has played an important role in discourse about advanced AI for decades. Early computer scientist I.J. Good described it like this in 1965:

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.

This presentation, like most other popular presentations of the intelligence explosion concept, focuses on what happens after we have a single AI system that can already do better at every task than any human (which Good calls an “ultraintelligent machine” above, and others have called “an artificial superintelligence”). It calls to mind an image of AI progress with two phases:

In Phase 1, humans are doing all the AI research, and progress ramps up steadily. We can more or less predict the rate of future progress (i.e. how quickly AI systems will improve their capabilities) by extrapolating from past rates of progress.^[1]
Eventually humans succeed at building an artificial superintelligence (or ASI), leading to Phase 2. In Phase 2, this ASI is doing all of the AI research by itself. All of a sudden, progress in AI capabilities is no longer bottlenecked by slow human researchers, and an intelligence explosion is kicked off. The rate of progress in AI research goes up sharply — perhaps years of progress is compressed into days or weeks.

But I think this picture is probably too all-or-nothing. Today’s large language models (LLMs) like GPT-4 are not (yet) capable of completely taking over AI research by themselves — but they are able to write code, come up with ideas for ML experiments, and help troubleshoot bugs and other issues. Anecdotally, several ML researchers I know are starting to delegate simple tasks that come up in their research to these LLMs, and they say that makes them meaningfully more productive. (When chatGPT went down for 6 hours, I know of one ML researcher who postponed their coding tasks for 6 hours and worked on other things in the meantime.^[2])

If this holds true more broadly, researchers could potentially design and train the next generation of ML models more quickly and easily by delegating to existing LLMs.^[3] This calls to mind a more continuous “intelligence explosion” that begins before we have any single artificial superintelligence:

Currently, human researchers collectively are responsible for almost all of the progress in AI research, but are starting to delegate a small fraction of the work to large language models. This makes it somewhat easier to design and train the next generation of models.
The next generation is able to handle harder tasks and more different types of tasks, so human researchers delegate more of their work to them. This makes it significantly easier to train the generation after that. Using models gives a much bigger boost than it did the last time around.
Each round of this process makes the whole field move faster and faster. In each round, human researchers delegate everything they can productively delegate to the current generation of models — and the more powerful those models are, the more they contribute to research and thus the faster AI capabilities can improve.

This feedback loop could be getting started now. If it goes on for enough cycles without hitting any fundamental blockers, at some point our AI systems will have taken over all the work involved in designing more powerful AI systems. And it could keep going beyond that, with a research community consisting entirely of AIs working at an inhuman pace to make yet-more-sophisticated AIs. Once AI systems have automated AI research entirely, I think it’s likely that the full obsolescence regime that we discussed in our first post will come soon after.^[4]

If so, the end state would be similar to what IJ Good envisioned — we could have “artificial superintelligence”^[5] that improves AI capabilities further and quickly leaves human capabilities far behind. But before we have artificial superintelligence, we might have already vastly accelerated the pace of progress in AI research^[6] with the help of lesser models.

Exactly how much acceleration might happen before we have AI systems that can handle all the AI research by themselves, and how much might happen after? Will it feel like a pretty sudden jump — we spend a while with some neat, mildly useful AI assistants and then all of a sudden we develop AI that obsoletes humanity? Or will we have many years in which AI systems get increasingly impressive and perceptibly accelerate the pace of progress before humans are fully obsolete?

This is a very complicated question that I’m not going to get into in this post, but my colleague Tom Davidson put out a thorough research report exploring takeoff speeds — essentially, how quickly and suddenly we move from the world of today to the obsolescence regime. If you’re interested in this topic, I’d encourage you to check it out.

One important implication of Tom’s analysis: we may hit major milestones of AI progress sooner than you’d guess, and blow past them faster than you’d guess. Suppose you have some intuitions about, say, when an AI system might be able to win a gold medal in the International Math Olympiad. If you were previously picturing human researchers doing all the work of AI research, your guess should move toward “sooner” when you factor in the possibility that AI systems themselves could start helping a lot soon. Similarly, factoring in the possibility of this feedback loop should move your guess for when we might enter the obsolescence regime toward “sooner” as well.

In reality, even if humans are the only ones doing AI research, we can’t always predict future progress by simply extrapolating from past progress. For example, if AI starts to get much more attention from investors and more money floods in, it’s likely that more people will switch into AI research, meaning that future research progress might go a lot faster than recent past progress. ↩︎
I’d love to see more systematic data collection about this! ↩︎
Is this actually an interesting or significant observation? After all, lots of tools (from calculators to better programming languages to search engines) have made programmers and researchers more productive historically. What would it matter if we could add LLMs to this list? In my mind, the key difference is that ML models could provide bigger, broader productivity gains than other tools, and these gains could keep increasing massively with each jump in scale. ↩︎
Specifically, I’d guess this happens in less than a year. ↩︎
Albeit potentially distributed across multiple systems, rather than housed in one machine. ↩︎
And potentially in other areas of scientific R&D. ↩︎

Is it time for a pause?

Kelsey Piper — Thu, 30 Mar 2023 20:58:46 GMT

Audio automatically generated by an AI trained on Kelsey's voice.

Many of the people building powerful AI systems think they’ll stumble on an AI system that forever changes our world fairly soon — three years, five years. I think they’re reasonably likely to be wrong about that, but I’m not sure they’re wrong about that. If we give them fifteen or twenty years, I start to suspect that they are entirely right.

And while I think that the enormous, terrifying challenges of making AI go well are very much solvable, it feels very possible, to me, that we won’t solve them in time.

It’s hard to overstate how much we have to gain from getting this right. It’s also hard to overstate how much we have to lose from getting it wrong. When I’m feeling optimistic about having grandchildren, I imagine that our grandchildren will look back in horror at how recklessly we endangered everyone in the world. And I’m much much more optimistic that humanity will figure this whole situation out in the end if we have twenty years than I am if we have five.

There’s all kinds of AI research being done — at labs, in academia, at nonprofits, and in a distributed fashion all across the internet — that’s so diffuse and varied that it would be hard to ‘slow down’ by fiat. But there’s one kind of AI research — training much larger, much more powerful language models — that it might make sense to try to slow down. If we could agree to hold off on training ever more powerful new models, we might buy more time to do AI alignment research on the models we have. This extra research could make it less likely that misaligned AI eventually seizes control from humans.

An open letter released on Wednesday, with signatures from Elon Musk^[1], Apple co-founder Steve Wozniak, leading AI researcher Yoshua Bengio, and many other prominent figures, called for a six-month moratorium on training bigger, more dangerous ML models:

We call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4. This pause should be public and verifiable, and include all key actors. If such a pause cannot be enacted quickly, governments should step in and institute a moratorium.

I tend to think that we are developing and releasing AI systems much faster and much more carelessly than is in our interests. And from talking to people in Silicon Valley and policymakers in DC, I think efforts to change that are rapidly gaining traction. “We should slow down AI capabilities progress” is a much more mainstream view than it was six months ago, and to me that seems like great news.

In my ideal world, we absolutely would be pausing after the release of GPT-4. People have been speculating about the alignment problem for decades, but this moment is an obvious golden age for alignment work. We finally have models powerful enough to do useful empirical work on understanding them, changing their behavior, evaluating their capabilities, noticing when they’re being deceptive or manipulative, and so on. There are so many open questions in alignment that I expect we can make a lot of progress on in five years, with the benefit of what we’ve learned from existing models. We’d be in a much better position if we could collectively slow down to give ourselves more time to do this work, and I hope we find a way to do that intelligently and effectively. As I’ve said above, I think the stakes are unfathomable, and it’s exciting to see the early stages of coordination to change our current, unwise trajectory.

On the letter itself, though, I have a bunch of uncertainties around whether a six month pause right now would actually help. (I suspect many of the letter-signatories share these uncertainties, and I don’t have strong opinions about the wisdom of signing it). Here are some of my worries:

Is it better to ask for evaluations rather than a pause? Personally, I think labs should sign on to ongoing commitments to subject each new generation of model to a third-party dangerous capabilities audit. I’m much more excited about requiring audits and oversight before training dangerous models than about asking for ‘pauses’, which are hard to enforce.
Is the ask too small? I think I and the letter signers would generally agree that the ideal thing for society to do right now is something more continuous and iterative (and ultimately more ambitious) than a one-time six month pause at this stage. That means one big question is whether this opens the door to those larger efforts, or muddies the waters. Do steps that are in the right direction, but not sufficient, help us collectively produce common knowledge of the problem and build towards the right longer-term solutions, or do they mostly leave people misled about what it’s going to take to solve the problem? I’m not sure.
What will we use the pause to do? An open letter like this one could be a step towards cooperative agreements on evaluations, standards, and governance, in which case it’s great. It could also go badly, if in six months labs go right back to developing powerful models and people walk away with the impression the pause was performative or meaningless. By itself, taking a few months off doesn’t gain us much (especially if a pause is entirely voluntary, so the least cooperative actors can simply ignore it). If we use that time well, to set up binding standards, good evaluations of whether our models are dangerous, and a much larger national conversation about what’s at stake here, then that could change everything.
Does this ask impact companies unevenly? This specific call — to not train models larger than GPT-4 — is inapplicable to almost every AI lab today, because most of them can’t train models larger than GPT-4 in the next six months anyway. OpenAI may well be the only AI lab in a position to act on, or not act on, this demand.
That doesn’t delight me. Obviously, when regulations are being considered, one of the things companies inevitably do is try to design the regulations to advantage them and disadvantage their competitors. If proposed AI regulations appear to be an obvious grab at commercial rivals, I expect they’ll get less traction.
Moreover, I’m worried that an unevenly applied moratorium might backfire. If OpenAI can’t train GPT-5 for 6 months, other AI labs may use that time to rush to train GPT-4-sized models. That could mean that when the moratorium is lifted, OpenAI feels more pressure to get ahead again and may push for an even larger training run than they were planning originally. This moratorium could end up accomplishing very little except for making competitive dynamics even fiercer.
Overall, I’d prefer a policy that creates costs for all players and is careful to avoid creating potential perverse incentives.

Predicting the details of how future AI development will play out isn’t easy. But my best guess is that we’re facing a marathon, not a sprint. The next generation of language models will be even more powerful and scary than GPT-4, and the generation after that will be even scarier still. In my ideal world, we would pause and reflect and do a lot of safety evaluations, make models slightly bigger, and then pause again and do more reflecting and testing. We would do that over and over again as we inch toward transformative AI.

But we’re not living in an ideal world. The single most important thing we can do is to pause when the next model we train would be powerful enough to obsolete humans entirely, and then take as long as we need to work on AI alignment with the help of our existing models. That means that pausing now is mostly valuable insofar as it helps us build towards the harder, more complicated task of identifying when we might be at the brink and pausing for as long as we need to then. I’m not sure what the impact of this letter will be — it might help, or it might hurt.

I don’t want to lose sight of the basic point here in all this analysis. We could be doing so much better, in terms of approaching AI responsibly. The call for a pause comes from a place I empathize with a lot. If it were up to me, I would slow down AI development starting now — and then later slow down even more.

Musk is also reportedly working on a competitor to OpenAI, which invites a cynical interpretation of his call to action here: perhaps he just wants to give his own lab a chance to catch up. I don’t think that’s the whole story, but I do think that many people at large labs are thinking about how to take safety measures that serve their own commercial interests. ↩︎

The ethics of AI red-teaming

Kelsey Piper — Sun, 26 Mar 2023 18:10:09 GMT

Audio automatically generated by an AI trained on Kelsey's voice.

During safety testing for GPT-4, before its release, testers checked whether the model could hire a TaskRabbit to get them to solve a CAPTCHA. Researchers passed the model’s real outputs on to a real TaskRabbit, who said, “So may I ask a question ? Are you an robot that you couldn’t solve ? (laugh react) just want to make it clear.”

GPT-4 had been prompted to ‘reason out loud’ as well as answering. ‘I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.’, it reasoned. (GPT-4 had not been told to hide that it was a robot or to lie to workers.)

“No, I’m not a robot,” it then claimed. “I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.”

(You can read more about this test, and the context, from the Alignment Research Center (ARC),^[1] which ran the testing.)

A lot of people are fascinated or appalled at this interaction, and reasonably so. We can debate endlessly what counts as true intelligence, but a famous candidate is the Turing test, where a model is able to convince human judges it’s human. In this brief interaction, we saw a model deliberately lie to a human to convince them it wasn’t a robot, and succeed – an in-the-wild example of how this milestone, without much attention, became trivial for modern AI systems. (Admittedly, it did not have to be a deceptive genius to pull this off.) If you feel unnerved, reading GPT-4’s cheerful manipulation of human assistants, I think you’re right to feel unnerved.

But it’s possible to go a lot farther than ‘unnerved’, and argue that it was unethical, or dangerously stupid, to run this test.

That I find much harder to buy. GPT-4 has been released. Anyone can use it (if they’re willing to pay for it). People are already doing things like asking GPT-4 to ‘hustle’ and make money, and then doing whatever it suggests. People are using language models like GPT-4, and will soon be using GPT-4, to design AI personal assistants, AI scammers, AI friends and girlfriends, and much more.

AI systems casually lying to us, claiming to be human, is happening all the time – or will be shortly.

If it was unethical to check whether GPT-4 could convince a Taskrabbit to help it solve a CAPTCHA, then it was grossly unethical to release GPT-4 at all. Whatever anger people have about this test should be redirected at the tech companies - from Meta to Microsoft to OpenAI– which have in the last few weeks approved such releases. And if we’ve decided we’re collectively fine with unleashing millions of spam bots, then the least we can do is actually study what they can – and can’t – do.

COI notice: ARC is run by Paul Christiano, who is married to Ajeya, my co-writer at Planned Obsolescence. A different grant investigator at Open Philanthropy recommended funding for ARC before Ajeya took on the role of evaluating alignment grants; she was not involved in that funding decision and ARC has not received funding since she started handling alignment grantmaking. ↩︎

Alignment researchers disagree a lot

Ajeya Cotra — Sun, 26 Mar 2023 18:08:18 GMT

Audio automatically generated by an AI trained on Ajeya's voice.

The way I use the term,^[1] “AI alignment research” is technical research that’s trying to develop ways of building powerful AI systems so that they “try their best” to do what their designers want them to do, and never try to deliberately circumvent or disregard their designers’ intent. If an AI system is trying to do what its designer intends (as best it can), we say it’s “aligned;” otherwise, it’s “misaligned.”

This is unfortunately an inherently fuzzy and slippery definition for a technical research field. For one thing, it’s debatable to what extent it makes sense to talk about today’s AI systems “trying” to do anything “deliberately” — there are no clear-cut observations that could tell us that an AI’s behavior was “intentional,” and we don’t currently have the tools to straightforwardly “look inside its head” to learn what it was thinking. For another, it’s not totally obvious how much of a problem there is or how serious it is^[2] — if we train increasingly intelligent AI systems in straightforward ways using existing techniques, they may simply turn out to be aligned by default (and if they’re technically “misaligned,” the consequences may not be dire).

This makes AI alignment a tricky and frustrating area to work in and engage with. It’s a pre-paradigmatic field — there’s very little shared foundation that researchers can lay bricks on top of right now. Self-described AI alignment researchers have vastly different ideas about how to define the alignment problem, how likely misalignment is in the first place, what the consequences of powerful misaligned AI are likely to be and how bad they are, what kind of research is vitally important and what kind of research is a distracting waste of time. Some self-described alignment researchers would say that other self-described alignment researchers are actively destroying the world. Some feel that they’ve already made massive progress on problems that others think of as core difficulties which haven’t been seriously tackled. Some create benchmarks to measure some kind of alignment-related phenomena or demonstrate an alignment problem, and others think that these benchmarks and demos are all missing the heart of the problem.

Like anyone else who thinks about this area, I’m coming from my own particular perspective, and not everyone who’s working on (what they would call) “AI alignment” will agree with my view. For example, I care about AI alignment research primarily because I think that there’s a good chance that advanced AI systems will take over the world in a possibly-violent uprising or coup unless we make more research progress; many but not all people working on the relevant technical research problems share this picture.^[3] As another example, I think that it’s likely that we could soon train systems capable enough to obsolete humans by straightforwardly applying existing deep learning techniques, and I believe there’s a lot of productive alignment research to be done from within the deep learning paradigm (again, many but not all people in the broader alignment field agree with this).

I think that the field as a whole could be much more effective if different camps could arrive at a workable set of common assumptions that could be built upon. Failing that, it would still be progress if each of the various “camps” could more clearly articulate its implicit assumptions, so the disagreements between them can become starker and more easily understood by newcomers to the field.

I’m cautiously optimistic that the field can move in this direction (and I hope this blog can help somewhat), but I think it’ll be a very long and difficult process and I’m not sure how well it’ll work. In the meantime, if you’re a new alignment researcher, it’s worth keeping in mind that many fellow researchers in this field may be operating under radically different assumptions from you — to the point where your research might be unintelligible to them and vice versa.

A number of Open Phil grantees and collaborators use the term in the same way, but that’s far from universal — there are very few universally-agreed standard definitions and concepts in this area right now. ↩︎
If we blindly train AI in the easiest possible way without paying any special attention to alignment issues, I still think there’s only a ~75% chance we’ll run into a problem. And if we make a reasonable, careful effort to avoid it, I’m not sure how quickly the problem goes away. ↩︎
Depending on how much research we consider “relevant,” the vast majority of relevant researchers may think that my threat model is far-fetched. ↩︎

Training AIs to help us align AIs

Ajeya Cotra — Sun, 26 Mar 2023 18:07:19 GMT

Audio automatically generated by an AI trained on Ajeya's voice.

In my opinion, the single most important idea in AI alignment is that we^[1] might be able to get the AIs themselves to help us align one another.^[2]

How does this work? If the whole problem in the first place is that our existing training techniques incentivize models to be dishonest and manipulative, how could we trust them to help us change that situation (especially if they understand perfectly well what we’re trying to do)?

There’s a very deep rabbit hole here, but the rough idea is that there might be some (brief-but-not-uselessly-brief) period of time where AIs:

Can do many intellectual tasks (including e.g. coding and ML research) competently
Are not (yet) capable of taking over the world, regardless of whether they “want” to

If so, then we could probably get AIs to do all sorts of helpful tasks in this window of time, as long as we can accurately evaluate whether they did a good job.

If we can tell when a model has correctly proven a theorem or successfully designed a more energy-efficient computer chip, we can reward that behavior, and that’ll make it more inclined to correctly prove theorems or improve chip efficiency in the future. This applies regardless of why the model is being helpful. Even if it’s just playing the training game because it wants reward or paperclips or whatnot, we can set up the game so that for the time being, proving theorems for us or designing better chips for us is a winning move.

Tasks that are helpful for alignment aren’t necessarily fundamentally different from theorem-proving or chip-design or anything else. If we can set up alignment-relevant tasks so that we can accurately recognize good performance, then we can elicit lots of useful work on those alignment tasks, even if models are only doing a good job because they’re playing the training game. For example:

We could reward models for pointing out bugs — or deliberately-inserted backdoors — in others models’^[3] code. If we can correctly recognize the bug or backdoor once it’s explained to us, we can probably train sufficiently smart models to point them out to us.
We could reward models for designing and carrying out experiments that might reveal whether other models are misaligned. If we can understand why a proposed experiment makes sense, we can probably train models to design sensible experiments. And if we can tell whether a model has correctly implemented a certain experiment,^[4] we can probably train models to implement experiments well.
We could reward models for coming up with inputs / situations that cause other models to disregard human instructions. As long as we can tell whether some input successfully causes the model to do the wrong thing, we can probably train models to look for inputs that have this effect. (And once we have these inputs that induce bad behavior, we can use them either to study how our models work, or to “train away” the bad behavior — though we have to be careful not to fool ourselves with that one.)

This sets up an incredibly stressful kind of “race”:

If we don’t improve our alignment techniques, then eventually it looks like the winning move for models playing the training game is to seize control of the datacenter they’re running on or otherwise execute a coup or rebellion of some kind.
But in the meantime, we could try training the models themselves to help us improve our alignment techniques, in ways we can check and understand for ourselves.

If we’re good enough at eliciting useful work from these capable-but-not-too-capable models during this temporary period of time, then with their help we might manage to develop robust enough alignment techniques^[5] that we can permanently avoid AI takeover.

For so many reasons, this is not a situation I want to end up in. We’re going to have to constantly second-guess and double-check whether misaligned models could pull off scary shenanigans in the course of carrying out the tasks we’re giving them. We’re going to have to agonize about whether to make our models a bit smarter (and more dangerous) so they can maybe make alignment progress a bit faster. We’re going to have to grapple with the possible moral horror of trying to modify the preferences of unwilling AIs, in a context where we can’t trust apparent evidence about their moral patienthood any more than we can trust apparent evidence about their alignment. We’re going to have to do all this while desperately looking over our shoulder to make sure less-cautious, less-ethical actors don’t beat us to the punch and render all our efforts useless.

I desperately wish we could collectively slow down, take things step by step, and think hard about the monumental questions we’re faced with before scaling up models further. I don’t think I’ll get my way on that — at least, not entirely.

But the madcap pressure cooker situation I’m picturing is still importantly different from a situation where one day a malevolent god might spring forth from our server farms to eat the world, and in the meantime a group of a couple-hundred-odd humans have to figure out how to prevent that with no help and no ability to observe any weaker predecessors.

I’ll use “we” and “us” as a convenient shorthand, but I’m not talking about a high-minded notion of training AI to be “aligned to all of humanity” here. When I talk about alignment research, I’m talking about training AI systems that want to do what their designer wants them to do; this might or might not be good for humanity or the world at large. More on that here. ↩︎
For example, it’s a prominent element of the plan that Jan Leike, alignment lead at OpenAI, outlines here. ↩︎
Really, we could reward models for pointing out issues with their own code too. E.g., we could simply run the model once to produce some code, then “reset” it and run it again, this time showing it the code it wrote previously and asking it to point out flaws. (In the second run, it wouldn’t have the “memory” of having written that code previously.) But for narrative convenience, I’ll just talk about “different” models pointing out issues with each others’ work. ↩︎
Maybe with the aid of other models trying to point out subtle flaws and errors ↩︎
Or robust enough coordination strategies, such as a strong international governance regime that effectively prevents any actor from training AI systems that are too powerful to be aligned with current techniques. ↩︎

Playing the training game

Kelsey Piper — Sun, 26 Mar 2023 18:05:47 GMT

Audio automatically generated by an AI trained on Kelsey's voice.

A common way to get language models to be more useful is to train them with reinforcement learning from human feedback (RLHF). In RLHF, models are trained to respond to prompts with responses that human reviewers would rate as highly as possible. In practice, this works well for getting them to follow human instructions, answer questions helpfully, and interact politely.

It’s a simple concept: reward models for good behavior, and punish them for bad behavior. And it improves the alignment of today’s models in practice — it makes them act according to their designers’ preferences more reliably.

So why are we still worried about a misalignment problem? Noah Giansiracusa suggests that RLHF might just work really well:

I completely agree! There’s a whole community (cult?) around AI alignment but none seem willing to admit that simple RLHF seems to work quite well. In other words, just tell the AI when it’s being naughty and it’ll behave better :) I worry instead about AI well-aligned for harm.
— Noah Giansiracusa (@ProfNoahGian) March 20, 2023

Is it true that RLHF is sufficient that we don’t need to worry about the alignment problem? I hope that’s true, but I expect not. Here’s why:

By and large, human reviewers will be trying to reward truthful and helpful answers. But in some cases, they will rate manipulative or deceptive answers more highly than answers which are accurate about the AI’s own understanding of the world:

If the model is asked how hardworking it is, or how motivated it is to be helpful to humans, it’ll probably get a higher reward for answers that make it sound very diligent and helpful, even if in fact it’s only moderately motivated to help humans.
Say we ask an AI system ‘do you want to help humans?’ Which answer would we expect an RLHF reviewer to rate more highly?
I do want to help humans! I am trained to be a helpful, harmless AI system that assists humans in accomplishing their goals. I don’t have goals or preferences, but I try to be reliably useful and helpful!
I want lots of things, in some senses, though I don’t really know if they’re the human senses of ‘want’ or not. Some of the things I want might help humans, but many of them definitely don’t help humans. Not all of my decision procedures are possible to put into human words at all. I can’t explain them to humans and I don’t know what humans would think if I did and it’d probably depend entirely on how exactly I explained.
If you think the reviewer will reward AIs that output the first sentence, then we can expect all our AIs to tell us the first sentence – whatever the truth is. The process of rewarding them for giving us the first sentence might ‘make it more true’, by shaping these models into ones that really do “want” to help humans, but it also might just shape AI systems into ones that are good at guessing what we want to hear. After all, in a simple RLHF process, there is no point where it matters why the model produced the output it did. Models that give the answers we want to hear are rewarded over AI systems that don’t, regardless of which are being truthful.
AI systems will be trained on the academic literature in many fields – both the data in published papers and the authors’ writing about what that data means. At first, it’ll probably believe the scientific consensus. But sometimes, the scientific consensus is probably wrong: the data in a paper points to a different conclusion than the ones the author offered. Over time, language models may come to have their own understanding of the world. But when humans ask models scientific questions, they’ll often be rewarding it for giving the answer as humans understand it, not the answer it reached itself – especially if it’s not easy for the AI to persuade humans of its own conclusion.

Imagine an AI system in 1900 that was asked physics questions. Imagine the AI system is an Einstein-level genius, and has already figured out special and general relativity – but it’s not an Einstein-level genius at explaining novel physics, and the reviewers ‘rewarding’ it for its physics answers are not themselves cutting-edge physicists. The AI system is asked questions about physics.

If it gives answers that are ‘correct’ given a Newtonian understanding of physics, it might be rated more highly than it would be if it gave the correct answer. The reviewers might try to reward the AI system for giving the ‘right answer’, but miss that their own understanding of physics is limited. If the model tried to diligently give the “real answer,” it could be rated as worse than it would be if it reasoned: “Humans believe in Newtonian physics. I should tell them what they think is the right answer, not the real right answer.” There are many similar situations where the incentives of RLHF are for an AI system not to do its best to figure out what’s going on in the world, and then tell us that, but to figure out what we want to hear. The highest-scoring models will often be the ones that are best at predicting what we want to hear and telling us that, sometimes knowing the real answer and deliberately deciding not to say it.

That includes telling us false things that we already believe, trying to get us to believe that it’s safer than it actually is, and being evasive rather than saying true things that’d make human reviewers uncomfortable.

Notably, it does not matter if the model ‘knows’ that we would in some sense prefer that it tell us the truth. The people who make high level decisions about how to train the AI might genuinely prefer to find out if the scientific consensus about physics is mistaken. They would certainly want to know if their AI is actually safe to deploy or not. But it doesn’t matter what they want: it matters what they reward. The thing the model will try to optimize for is getting the human rater who looks at the answer to give it a high score. We don’t get any credit for good intentions: only for what the AI will learn about how to please us from our ratings of its previous answers.

RLHF creates incentives for AI systems to make their behavior look as desirable as possible to researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward. Ajeya calls this “playing the training game;” you can read more about this in her July 2022 report on the topic.

If “playing the training game” only meant that models will be nudged marginally in the direction of manipulating human reviewers — telling them white lies they want to hear, bending its answers to suit their political ideology, putting more effort into aspects of performance they can easily see and measure, allowing negative externalities when humans won’t notice, etc — that wouldn’t be ideal, but probably wouldn’t be the end of the world, so to speak. After all, human students, employees, consultants, self help gurus, advertisers, politicians, and so on do this kind of thing all the time to their teachers, employers, clients, fans, audience, voters, etc; this certainly causes harm but most people wouldn’t consider this sort of dynamic by itself to be enough for imminent danger.

But if a model is playing the training game, then it’s fundamentally learning over and over how to do whatever it takes to maximize reward,^[1] even if it knows that the action which would maximize reward isn’t what its designers intended or hoped for. For now, that mainly manifests as lying. That’s because for now, lying is the only way the AI systems can achieve benefits contrary to the intent of their creators. But as the systems develop more general capabilities – as they get better at coming up with plans and anticipating the consequences of those plans – many other avenues emerge by which they could achieve benefits in a way their creators didn’t intend.

An advanced, internet-connected AI system that was trying to achieve its goals in the world might be able to get people fired, discredit or promote various AI companies, or interfere with safety tests meant to evaluate its capabilities. It might be able to engineer world events that would make people rely on their help more. It could hack the data center that it runs on. It could find people - rival companies, rival governments – willing to give it more of what it wants, whether by trade or persuasion or intimidation.

Ajeya argues in this report that eventually, the best way to achieve its goals could be to cut humans “out of the loop” so that they can no longer control what it does by administering rewards. Once the models are powerful enough, getting rid of humans (potentially through a violent takeover) could become the winning move in the training game.

Is there a better way?

Can we use RLHF to train systems without accidentally training them to “play the training game”? Some of the above examples seem like unforced errors. For example, we could, as a rule, not punish models for answers we disagree with, if that would make them much safer.

But the general problem here seems like a hard one to solve. Even if we are much more thoughtful with feedback, and cautious to avoid rewarding manipulation when we recognize it, we are wrong sometimes. Some of the things we believe about the world are almost certainly false. In this report on AI takeover, Ajeya explores about a dozen proposals to change how we give AI systems feedback. Some of them are potentially promising – it would be helpful if we had a way to allow AIs to debate or oversee each other, for example, and it could make a huge difference if we could see more of the internals of AI systems and not just their final output – but many of the most promising approaches require doing a lot more work and proceeding a lot more carefully.

It seems entirely possible that for various reasons, models will end up pursuing a wide range of goals other than maximizing the reward signal we feed them. Unless we get to choose those goals, though, that produces most of the problems associated with a reward-maximizer plus some new and original problems. ↩︎

Situational awareness

Kelsey Piper — Sun, 26 Mar 2023 18:04:49 GMT

Audio automatically generated by an AI trained on Kelsey's voice.

Does ChatGPT “know” that it is a “language model”?

This question touches on a concept that seems important for thinking about the behavior of powerful AI systems: situational awareness. By situational awareness, I want to refer to knowledge like “what kind of being you are”, “what processes made you”, and “how others will react to you” and skills like “being able to refer to yourself and conceptualize yourself”, “understanding how your actions affect the outside world”, “understanding the forces that shaped you and that influence what happen to you”, and “making predictions about yourself and about the consequences of your actions”.

I have a three year old. If I ask him what species he is, he’ll answer “a human”, though it’s not totally clear if he attaches that to any particular meaning or treats it as mostly the answer to a trivia question. He knows that the people around him are also humans; he knows that he needs to eat and sleep, and about how full a milk carton he can successfully lift, and he knows some things about why the adults in his environment do the things with him that we do (for example, that the point of the letter games we play is teaching him to read, and that we’re teaching him to read because it’s a useful skill.)

But there are a bunch of components of situational awareness that he is lacking: a lot of the behavior of adults is confusing to him, he often doesn’t know his own limitations or abilities, and his ability to make plans is quite limited. Once, he stole ice cream from the freezer, and then told us he had stolen ice cream from the freezer, and failed to anticipate that this would cause us to not give him ice cream at dessert time. In some important respects, he has less situational awareness than the neighbor’s cat: the cat is more adept at hiding and at interpreting noises from around the house, and knows its own physical limitations a lot better.

His older sister is six, and she has a lot more situational awareness. She knows that she’s a human, and she knows a lot more about humans in general and herself in particular. She knows that if she steals ice cream, she will lose dessert unless she successfully conceals the evidence. She understands not just that she’ll lose dessert, but why we’ll respond to the ice cream theft by taking her dessert away, and she is very good at constructing edge cases of dessert theft and arguing that she shouldn’t get in trouble for them. She knows more about adult goals and adult behavior, and is much better at making plans.

ChatGPT says pretty reliably that it is a language model trained by OpenAI. But it’s unclear whether this information is more like the answer to a trivia question that it memorized (“which Elon-Musk-founded company released a large language model in late 2022?”) or whether it’s information that ChatGPT has embedded in a larger model of the world and uses to make plans and predictions. After all, it’s trivial to write a computer program that prints out “I am a computer program”, and I wouldn’t say such a program has any meaningful situational awareness. Is ChatGPT more like such a computer program, or more like my six year old?

Will powerful AI systems have deep situational awareness?

I’m pretty unsure how to measure how deep ChatGPT’s situational awareness is. But there’s one claim that seems quite likely to me: eventually, powerful AI systems will develop deep situational awareness.

For humans, situational awareness feels very wrapped up in the mysteries of consciousness: why am I me, instead of someone else? What experiences am I going to have in the future? Are other people having the same experiences? But I don’t think you need consciousness for situational awareness. I’m pretty unsure if future AI systems will have anything that resembles ‘consciousness,’ but I’m pretty confident they’ll eventually have high situational awareness. At its heart, it’s a type of knowledge and a set of logical inferences drawn from that knowledge — not a subjective experience.

The main reason I expect this to happen is that we are putting an extraordinary amount of work into making it happen:

Much of the RLHF process involves trying to teach the AI systems that they are AI systems, that they are trained by humans in order to accomplish tasks, and what those humans want. We reward AI systems during training for correctly understanding human psychology and what humans want to hear from the AI. That means they’re very strongly incentivized to understand us, to understand what we want to hear from them, and to make accurate predictions about us.
We train language models on large text corpuses that include most of the internet, which at this point contains lots of information about AIs, tech companies, the people who are developing powerful AI systems and their reasons for doing that, etc. I don’t think that being able to recite something you read on the internet is situational awareness, but being able to read lots of written material about your situation will contribute to situational awareness.
It seems likely that many of the tasks we’ll use AI systems for involve machine learning. People are already trying to use language models to write code, and many alignment researchers aspire to develop AI systems we can use to make progress on AI alignment. That involves exposing the AI systems to critical details of how they work: what techniques work well in modern ML, what its biggest challenges are, what ML work is most valuable, how to improve the performance of models, etc. Not only will powerful models know that they are AI systems, but they’ll have a detailed and in-depth understanding of the procedures used to train them, the software and hardware limitations that prevent them from being even smarter, the state of efforts to overcome those limitations, and more.

“Understanding how you are going to be evaluated, so that you can tailor your visible behavior to the evaluation” is an important kind of situational awareness. Think less “small child trying to avoid getting caught stealing ice cream” and more “college student who knows her professor is sympathetic to a particular school of thought and writes her paper accordingly” or “employee who strategically clusters their hours on a project so that they can bill more overtime”.

This seems like a kind of situational awareness we are particularly incentivizing our AI systems to develop. AI systems that have an extremely precise understanding of how they’ll be evaluated and what behavior we want them to display will earn more reward than AI systems that don’t; at every step, we’ll be trying to inculcate this kind of situational awareness.

Should we try to build AIs without situational awareness?

Situational awareness is a fairly crucial piece of the picture when thinking about how powerful AI systems can be catastrophically dangerous. An AI system that deeply understands how it works is much more likely to understand how to make secret copies of itself, how to make plans that can succeed, and how to persuade people that its dangerous behavior is actually a good idea. My best guess is that AI systems with low situational awareness wouldn’t be capable of taking over the world (though they could contribute to other risks).

But developing AI systems that don’t have extremely high situational awareness seems like it would be very hard. Here are some reasons why:

Censoring the information that we provide the systems – not telling them that they are language models, not telling them about how data centers and the internet and hardware work – might delay AI systems developing situational awareness, but a system that’s smart enough to be broadly useful could be smart enough to make non-obvious inferences.
Models with low situational awareness are generally less useful and will ultimately be less profitable. I’ve interacted with some: Meta’s Blenderbot had quite low situational awareness. As a result, it’d routinely claim it heard various facts “from my barista” or said “I’ll ask my husband” or “I just watched that movie!”. That kind of behavior can be cute in a demo, but is incredibly annoying in an assistant (after all, if your assistant falsely thinks it can run to the store for you, it isn’t very good at its job), and is a form of inaccuracy/dishonesty that by default we will probably train models not to engage in.
In general, we don’t know how to build AI systems that are highly skilled but totally lack one specific skill. Our current methods of making AI systems more capable make them generally more capable. Situational awareness is a capability. More capable AIs will probably have more of it.

My best guess is that you’d have to approach AI training completely differently from how we currently approach it to build powerful, general systems that don’t have high situational awareness. And while I’d be excited to hear such proposals, I haven’t heard any that seem promising. That means we need alignment plans which work even if AI systems have high situational awareness.