Even if 50% Success is going super-exponential now, wouldn't you expect 80% Success to need to go exponential before full automation of AI R&D happens?
Wouldn't you need even better than 80% success rate for fully automation? As the time horizons get longer, its will take more and more time and effort for humans to review the output, so you can't hand off all the work to AI if it'll fail 20% of the time.
That was my thought, and the basis fot an argument that 10% full automation of AI R&D by end of 2026 is too high. But it's hard to be too confident. I'm probably at 5-10%.
"Full automation" is rather arbitrary. I expect your task list is long enough that you can tell AI to do the next 5 things on the list, then it fails on one of them. You work on that task while telling it to do the next 5 things. So 80% success rate would be a de facto 5x speedup.
People seem to have a bias towards coming up with reasons to believe that AI progress will be slow and/or unimpactful. People aren't comfortable facing the possibility that it could be really dramatic and impactful. But if you want to predict reality accurately, you should spend ~equal amounts of time thinking of reasons why AI progress will be both fast and slow.
My median for 80% Success by end of 2026 was about 5 hours in January (and now may be more like 10 hours). Though my 90th percentile forecast is probably something like a couple days / not measured properly, so maybe this isn't actually a reason to think your 10% forecast for full automation of AI R&D is too high. Unless 95+% Success needs to get into the >24 hour range for full automation of AI R&D to be reached, in which case I do think there is a less than 10% chance of this by end of 2026, which would make 10% chance of full automation of AI R&D too high.
Opus 4.6 was on trend for 80%, but above trend for 50%. My model for 80% is that it lags behind 50% by about 10-12 months. So just as 50% accelerated when it got to tasks that took ~5-12 hours, I expect 80% will accelerate when it gets to such length tasks as well. In other words, I don't expect SOTA for 80% to go from 4 hours to 6 hours to 8 hours to 10 hours, but instead expect it to jump up faster ij that range just as 50% did. So my median by end of 2026 is significantly higher thanks to to the jump that happened with 50% 10-12 months before the end of 2026.
(1). Do you think that the 50% time horizon graph going super-exponential will reduce this lag time, or do you think that the 80% time horizon graph will just go super-exponential in roughly the same way but ~10-12 months later?
(2). In the AI futures model update Daniel has automated coder coming first. His automated coder has an 80% time horizon of 3 work years (6,240) hours. I am wondering when you think that we would see an 80% time horizon of 3 work years (6,240)?
(1) I'd guess the later with my median forecast. Of course with my tail forecasts it seems much more likely that the lag time will e.g. decrease to 3 months by end of 2026 (or shorter eventually) than it is that it will increase to 18 months. But my median forecast of ~10 hours by end of 2026 means my median forecast is that the lag time will remaim about ~10-12 months at the end of 2026.
(2) I don't know, I haven't thought of that question yet. Also, it's hard to ascertain which tasks/projects take humans 3 years, and I suspect the more relevant metric will be what Ajeya mentions in this post, namely AI reliability at tasks that take human *teams* long amounts of time. I also haven't looked at Daniel's update yet, but will check it out, thanks.
Actually reading the whole thing might be overkill, but it includes an influential discussion of why it's hard to partition programming tasks across multiple programmers.
Agreed with everything written in the comments to you so far. Lots of useful feedback for you to iterate on!
1. We already measure 80%, so simply stop reporting 50% as it is clearly becoming saturated and therefore less useful.
2. 80% seems to lag 50% by nearly a full year, so that buys you a LOT of time for your colleagues to build better tasks.
3. Clearly AI agents need much higher success rates than 80%. Investigate 95% success rate measurements in the future for further savings as your tasks begin to become saturated at 80%.
Currently 80% Success lags behind 50% Success by about ~10-12 months. I guess a key question is whether 50% going super-exponential will reduce this lag time, or whethet 80% will just go super-exponential in roughly the same way but ~10-12 months later.
As models become capable of longer-horizon tasks, release cycles are shrinking, not expanding.
Doesn't this mean that labs are systematically under-testing the most advanced models on behaviors that may emerge in really long activities or over extended interactions?
We may be shipping models whose long-horizon character we've never actually seen.
METR time horizon doesn’t seem to transfer perfectly to performance on real-world tasks, which are often messier and harder to evaluate. Performance on the Remote Labor Index, made to measure capability on actual remote work, is just 4.17% for Opus 4.6.
Curious about your thoughts on this. E.g., could you divide 80% time horizon by a constant to approximate real-world time horizon? Or is the relation more complicated?
I have an interesting experience that runs slightly counter to this. I am not a SWE by trade although i can do basic scripting. Opus4.6 + Claude Code is VERY good. So far I've built two projects with it:
1. is a simple Threat intelligence project built on top of unique data and a well documented API (but one that likely doesn't appear often in training data)
2. A more complex project around merging a bunch of disparate (but common data sources together) and finding unique insights ontop of them
Project 2 has been MUCH easier (as in, most things are a one-shot) while project 1 I have to be very involved in the details of what it is doing any how it is working. Including correcting very basic mistakes that surprise me (for example, searching Russian language dark web forums for "cybercrime" doesn't yield amazing results). But there are broader issues (the analytics pipeline has ALOT more bugs despite being far more conceptually simple)
This has led me to the conclusion that:
1. it is like many SWE tasks are very similar (for example building analytics on top of marketing/sales data is probably extraordinarily robustly demonstrated in training data.)
2. AI SWE is very good for in distribution tasks, but struggles and makes extraordinarily basic mistakes when working on projects that are out of distribution.
For reference, i've probably spent 5 hours bug fixing project 2, and 40-50 hours bug fixing project one.
I'm not sure a "months-long programming *task*" is a coherent concept. A task and a project (I'm not sure this is the best word for it, but you used it at least once in your post) are different in ways that matter: I might even propose that a task is something that can be done without talking to another person or agent, a "build this thing to spec" scenario, whereas a project is something that inherently requires ongoing communication, the owner (or owners) discovering the spec through interaction with multiple parties.
Under this approach, is it plausible that AIs are saturating performance on tasks, but are still pretty bad at prjoects?
(I'm not sold on this is being the best decomposition, I'm spitballing.)
> This is why my colleague Tom proposed that the calendar time it takes a large team of humans to do a task might be a better proxy for “intrinsic difficulty” than the time it takes one human working alone.
This brings up in me a 2d METR chart in which models plot a course on n-person-t-horizon tasks, confidence intervals expanding as rectangles towards the upper right.
> Human work appears to get more and more decomposable the longer and longer it gets.
Might be just an artefact of *our* human time horizon caused by low conscientiousness & focus; which tasks lie in the envelope of a human who cares about one thing for a thousand years? See also the factored cognition debate.
Why wouldn’t this logic of task decomposability always hold for tank length increases? So for a very long time models could output single lines of PyTorch, but that didn’t imply that they could therefore do the example 4 hour tasks. I wonder if top human CEOs / politicians are the rare examples of humans capable of multi-year / decade execution, but from their interviews they talk a lot about pivoting
“But at the end of the day, that dataset had 19 software engineering tasks estimated to take humans longer than 8 hours, and Opus 4.6 was able to solve 14 of them at least some of the time (and it reliably nailed four of them”
It means Opus 4.6 has around a 20% reliability on 8 hours tasks. In real life, you don’t have a ground truth, so how long it would take to verify the result? Then, In case the solution is defective, how long it would take to fix it? If the total length of time is longer than 8 hours then you don’t have any real productivity gains.
I like the idea of looking at the calendar time it takes a large team to complete a task. But fundamentally, I'd say the problem is that as tasks get harder, it becomes more natural to think not in terms of “how long does this take?” but rather “what fraction of people can do this at all within a reasonable amount of time?” (E.g. only 50 people a year get IMO gold medal)
So I am not sure what to infer from "AI can do tasks that human teams can do in a month" because if the task is something like "develop successful app", it seems that time is no longer the main bottleneck.
For example to estimate automated AI R&D I would go to theoretical CS where we have somewhat standardized quality levels -- I can imagine a graph whose y-axis is something like: B conference, A conference, A* conference, STOC/FOCS (one step corresponds to maybe 5x less people producing work at that level). I think current AI is almost able to produce B conference level papers on its own (based on recent AI proofs). Automated AI R&D seems to me to happen around STOC/FOCS level. So maybe if one can plot the slope of how fast this improves one could get some prediction for when it happens.
Or maybe it's better to plot "how much money AI can earn", again hope for straight line after taking log, and put the price of "automated AI R&D" to 1 trillion dollars? I'm not sure, I just feel that thinking in terms of "percentile-AGI" instead of "t-AGI" is a bit more bearing for estimating the more distant future.
I also think as time-horizons go to months, it will usually be decomposable to smaller subtasks +minimal scaffold so that models with lower time-horizon can scale-up--except in cases where it requires some "leaps of insight" where you should expect in humans also to see extremely wide distribution of task-completion times.
However, I do believe this is because we are hitting into some limitations of human brains not of all tasks (obviously). And that in some other agent form with better ability to synthesize information we would expect to see much more tasks which are completeable at months - year scales if mapped to humans--but the raw ability of humans to keep information & update information & learn isn't up to this. Open to being wrong on this ofc
What if this is because the human 50% time horizon is between a few days and a month? Sure, mathematicians can work on something for years, but that is quite rare and does not happen 50% of the time. It is actually pretty hard to work on something for a month without good “scaffolding”, i.e. good plans or teamwork, which is really about breaking up the one project into sub-projects.
I suppose the hypothesis would be that good scaffolding, that decomposes tasks, enables an agent/team time horizon to increase, and this is the only way humans can complete bigger projects reliably.
But a much more capable agent than humans would not need such scaffolding.
Even if 50% Success is going super-exponential now, wouldn't you expect 80% Success to need to go exponential before full automation of AI R&D happens?
Wouldn't you need even better than 80% success rate for fully automation? As the time horizons get longer, its will take more and more time and effort for humans to review the output, so you can't hand off all the work to AI if it'll fail 20% of the time.
That was my thought, and the basis fot an argument that 10% full automation of AI R&D by end of 2026 is too high. But it's hard to be too confident. I'm probably at 5-10%.
"Full automation" is rather arbitrary. I expect your task list is long enough that you can tell AI to do the next 5 things on the list, then it fails on one of them. You work on that task while telling it to do the next 5 things. So 80% success rate would be a de facto 5x speedup.
People seem to have a bias towards coming up with reasons to believe that AI progress will be slow and/or unimpactful. People aren't comfortable facing the possibility that it could be really dramatic and impactful. But if you want to predict reality accurately, you should spend ~equal amounts of time thinking of reasons why AI progress will be both fast and slow.
My median for 80% Success by end of 2026 was about 5 hours in January (and now may be more like 10 hours). Though my 90th percentile forecast is probably something like a couple days / not measured properly, so maybe this isn't actually a reason to think your 10% forecast for full automation of AI R&D is too high. Unless 95+% Success needs to get into the >24 hour range for full automation of AI R&D to be reached, in which case I do think there is a less than 10% chance of this by end of 2026, which would make 10% chance of full automation of AI R&D too high.
Why did you update your 80% so much when opus 4.6 seems on trend?
Opus 4.6 was on trend for 80%, but above trend for 50%. My model for 80% is that it lags behind 50% by about 10-12 months. So just as 50% accelerated when it got to tasks that took ~5-12 hours, I expect 80% will accelerate when it gets to such length tasks as well. In other words, I don't expect SOTA for 80% to go from 4 hours to 6 hours to 8 hours to 10 hours, but instead expect it to jump up faster ij that range just as 50% did. So my median by end of 2026 is significantly higher thanks to to the jump that happened with 50% 10-12 months before the end of 2026.
(1). Do you think that the 50% time horizon graph going super-exponential will reduce this lag time, or do you think that the 80% time horizon graph will just go super-exponential in roughly the same way but ~10-12 months later?
(2). In the AI futures model update Daniel has automated coder coming first. His automated coder has an 80% time horizon of 3 work years (6,240) hours. I am wondering when you think that we would see an 80% time horizon of 3 work years (6,240)?
(1) I'd guess the later with my median forecast. Of course with my tail forecasts it seems much more likely that the lag time will e.g. decrease to 3 months by end of 2026 (or shorter eventually) than it is that it will increase to 18 months. But my median forecast of ~10 hours by end of 2026 means my median forecast is that the lag time will remaim about ~10-12 months at the end of 2026.
(2) I don't know, I haven't thought of that question yet. Also, it's hard to ascertain which tasks/projects take humans 3 years, and I suspect the more relevant metric will be what Ajeya mentions in this post, namely AI reliability at tasks that take human *teams* long amounts of time. I also haven't looked at Daniel's update yet, but will check it out, thanks.
(By “I'd guess the later with my median forecast.” Do mean reduced lag time?
You should read the Fred Brooks classic The Mythical Man-Month if you haven't yet!
Actually reading the whole thing might be overkill, but it includes an influential discussion of why it's hard to partition programming tasks across multiple programmers.
Agreed with everything written in the comments to you so far. Lots of useful feedback for you to iterate on!
1. We already measure 80%, so simply stop reporting 50% as it is clearly becoming saturated and therefore less useful.
2. 80% seems to lag 50% by nearly a full year, so that buys you a LOT of time for your colleagues to build better tasks.
3. Clearly AI agents need much higher success rates than 80%. Investigate 95% success rate measurements in the future for further savings as your tasks begin to become saturated at 80%.
Currently 80% Success lags behind 50% Success by about ~10-12 months. I guess a key question is whether 50% going super-exponential will reduce this lag time, or whethet 80% will just go super-exponential in roughly the same way but ~10-12 months later.
As models become capable of longer-horizon tasks, release cycles are shrinking, not expanding.
Doesn't this mean that labs are systematically under-testing the most advanced models on behaviors that may emerge in really long activities or over extended interactions?
We may be shipping models whose long-horizon character we've never actually seen.
What does this mean for AI safety?
METR time horizon doesn’t seem to transfer perfectly to performance on real-world tasks, which are often messier and harder to evaluate. Performance on the Remote Labor Index, made to measure capability on actual remote work, is just 4.17% for Opus 4.6.
Curious about your thoughts on this. E.g., could you divide 80% time horizon by a constant to approximate real-world time horizon? Or is the relation more complicated?
I have an interesting experience that runs slightly counter to this. I am not a SWE by trade although i can do basic scripting. Opus4.6 + Claude Code is VERY good. So far I've built two projects with it:
1. is a simple Threat intelligence project built on top of unique data and a well documented API (but one that likely doesn't appear often in training data)
2. A more complex project around merging a bunch of disparate (but common data sources together) and finding unique insights ontop of them
Project 2 has been MUCH easier (as in, most things are a one-shot) while project 1 I have to be very involved in the details of what it is doing any how it is working. Including correcting very basic mistakes that surprise me (for example, searching Russian language dark web forums for "cybercrime" doesn't yield amazing results). But there are broader issues (the analytics pipeline has ALOT more bugs despite being far more conceptually simple)
This has led me to the conclusion that:
1. it is like many SWE tasks are very similar (for example building analytics on top of marketing/sales data is probably extraordinarily robustly demonstrated in training data.)
2. AI SWE is very good for in distribution tasks, but struggles and makes extraordinarily basic mistakes when working on projects that are out of distribution.
For reference, i've probably spent 5 hours bug fixing project 2, and 40-50 hours bug fixing project one.
I'm not sure a "months-long programming *task*" is a coherent concept. A task and a project (I'm not sure this is the best word for it, but you used it at least once in your post) are different in ways that matter: I might even propose that a task is something that can be done without talking to another person or agent, a "build this thing to spec" scenario, whereas a project is something that inherently requires ongoing communication, the owner (or owners) discovering the spec through interaction with multiple parties.
Under this approach, is it plausible that AIs are saturating performance on tasks, but are still pretty bad at prjoects?
(I'm not sold on this is being the best decomposition, I'm spitballing.)
https://substack.com/@richplum32/p-192276845
> This is why my colleague Tom proposed that the calendar time it takes a large team of humans to do a task might be a better proxy for “intrinsic difficulty” than the time it takes one human working alone.
This brings up in me a 2d METR chart in which models plot a course on n-person-t-horizon tasks, confidence intervals expanding as rectangles towards the upper right.
> Human work appears to get more and more decomposable the longer and longer it gets.
Might be just an artefact of *our* human time horizon caused by low conscientiousness & focus; which tasks lie in the envelope of a human who cares about one thing for a thousand years? See also the factored cognition debate.
And, finally, to everyone else curious about this, the word "horse" occurs 227 times in the Project Gutenberg version of Anna Karenina (https://www.gutenberg.org/files/1399/1399-h/1399-h.htm).
Why wouldn’t this logic of task decomposability always hold for tank length increases? So for a very long time models could output single lines of PyTorch, but that didn’t imply that they could therefore do the example 4 hour tasks. I wonder if top human CEOs / politicians are the rare examples of humans capable of multi-year / decade execution, but from their interviews they talk a lot about pivoting
“But at the end of the day, that dataset had 19 software engineering tasks estimated to take humans longer than 8 hours, and Opus 4.6 was able to solve 14 of them at least some of the time (and it reliably nailed four of them”
It means Opus 4.6 has around a 20% reliability on 8 hours tasks. In real life, you don’t have a ground truth, so how long it would take to verify the result? Then, In case the solution is defective, how long it would take to fix it? If the total length of time is longer than 8 hours then you don’t have any real productivity gains.
I like the idea of looking at the calendar time it takes a large team to complete a task. But fundamentally, I'd say the problem is that as tasks get harder, it becomes more natural to think not in terms of “how long does this take?” but rather “what fraction of people can do this at all within a reasonable amount of time?” (E.g. only 50 people a year get IMO gold medal)
So I am not sure what to infer from "AI can do tasks that human teams can do in a month" because if the task is something like "develop successful app", it seems that time is no longer the main bottleneck.
For example to estimate automated AI R&D I would go to theoretical CS where we have somewhat standardized quality levels -- I can imagine a graph whose y-axis is something like: B conference, A conference, A* conference, STOC/FOCS (one step corresponds to maybe 5x less people producing work at that level). I think current AI is almost able to produce B conference level papers on its own (based on recent AI proofs). Automated AI R&D seems to me to happen around STOC/FOCS level. So maybe if one can plot the slope of how fast this improves one could get some prediction for when it happens.
Or maybe it's better to plot "how much money AI can earn", again hope for straight line after taking log, and put the price of "automated AI R&D" to 1 trillion dollars? I'm not sure, I just feel that thinking in terms of "percentile-AGI" instead of "t-AGI" is a bit more bearing for estimating the more distant future.
I also think as time-horizons go to months, it will usually be decomposable to smaller subtasks +minimal scaffold so that models with lower time-horizon can scale-up--except in cases where it requires some "leaps of insight" where you should expect in humans also to see extremely wide distribution of task-completion times.
However, I do believe this is because we are hitting into some limitations of human brains not of all tasks (obviously). And that in some other agent form with better ability to synthesize information we would expect to see much more tasks which are completeable at months - year scales if mapped to humans--but the raw ability of humans to keep information & update information & learn isn't up to this. Open to being wrong on this ofc
What if this is because the human 50% time horizon is between a few days and a month? Sure, mathematicians can work on something for years, but that is quite rare and does not happen 50% of the time. It is actually pretty hard to work on something for a month without good “scaffolding”, i.e. good plans or teamwork, which is really about breaking up the one project into sub-projects.
I suppose the hypothesis would be that good scaffolding, that decomposes tasks, enables an agent/team time horizon to increase, and this is the only way humans can complete bigger projects reliably.
But a much more capable agent than humans would not need such scaffolding.
Everyone makes mistakes