The Pokemon prediction seems harder to me than the other 80% predictions, though maybe that's just because I saw an early Claude Plays Pokemon and was surprised by how many basic things tripped it up. Something something "real world complexities are surprisingly tricky to address"? I think recent models do substantially better, but still get tripped up in silly ways. Maybe this gets solved by continuous progress in a few areas of weakness, though: integrated image processing + longer context windows + better "notes-to-self" writing.
Math feels like the cleanest task, in the sense that there's no surprising / spiky environmental features to process, it's "natively" a thing you can do via text stream.
VN design also feels intuitively easy to me, maybe the main need here is again just bigger context windows and better self-management. I should play with Claude Code and see what trips it up here!
How do you plan to handle the incoming era of amazing video games on tap? Plug your ears with wax, or try to harness it for The Good by generating gripping edutainment for yourself about AI progress?
Hmm the Pokemon prediction is about a typical ten year old, and they're not *that* good at Pokemon (though honestly I don't know haha). And yeah I think some chunk of this will probably get solved through better image processing and generic training to tamp down on hallucinations that transfers?
Ah yeah, I sort of glossed over the "average 10yo" bit, fair enough.
On a purely intellectual level, I'd love more benchmarking of different humans on some of these things we're having AIs do now -- would be very fun to see if human kids get tripped up on the same places Claude does.
Although it is somewhat against the spirit of empirical laws whose effect comes from aggregating large # of independent causes, I would like to ask people here to speculate on what sort of human tasks are >24h time horizon but cannot be easily delegated to a group of humans each having 24h windows + notes from a manager (which can be replicated from 24h agents)
I assume such tasks must somehow involve learning + memory on task-specific subtasks of a form which is for example not easily learned/transferred from an instruction manual + short time of practice. For example, a physical task may ask you to become proficient at using a new type of machinery, whose proficiency cannot be easily attained from short-time-scale practice. But for cognitive tasks it is kind of more difficult for me to understand the type of task that requires experience-based learning. Perhaps if the task has a subtask which involves learning a novel subject/novel programming language etc.
For your logistics prediction, do you have any qualifiers in terms of scaffolding or prompting? I imagine AI probably could pull this off already with enough specialized scaffolding and hand holding in prompts, but that’s not that impressive imho.
I doubt that they could do these tasks robustly / well with scaffolding right now, or else we would probably have already seen lots of examples. (I do mean to implicitly rule out very overfit scaffolds where the scaffold is doing a lot of the work and you get a very cookie-cutter birthday party that doesn't flexibly incorporate preferences and complications.)
Great post, ty!
The Pokemon prediction seems harder to me than the other 80% predictions, though maybe that's just because I saw an early Claude Plays Pokemon and was surprised by how many basic things tripped it up. Something something "real world complexities are surprisingly tricky to address"? I think recent models do substantially better, but still get tripped up in silly ways. Maybe this gets solved by continuous progress in a few areas of weakness, though: integrated image processing + longer context windows + better "notes-to-self" writing.
Math feels like the cleanest task, in the sense that there's no surprising / spiky environmental features to process, it's "natively" a thing you can do via text stream.
VN design also feels intuitively easy to me, maybe the main need here is again just bigger context windows and better self-management. I should play with Claude Code and see what trips it up here!
How do you plan to handle the incoming era of amazing video games on tap? Plug your ears with wax, or try to harness it for The Good by generating gripping edutainment for yourself about AI progress?
Thanks!
Hmm the Pokemon prediction is about a typical ten year old, and they're not *that* good at Pokemon (though honestly I don't know haha). And yeah I think some chunk of this will probably get solved through better image processing and generic training to tamp down on hallucinations that transfers?
Ah yeah, I sort of glossed over the "average 10yo" bit, fair enough.
On a purely intellectual level, I'd love more benchmarking of different humans on some of these things we're having AIs do now -- would be very fun to see if human kids get tripped up on the same places Claude does.
Although it is somewhat against the spirit of empirical laws whose effect comes from aggregating large # of independent causes, I would like to ask people here to speculate on what sort of human tasks are >24h time horizon but cannot be easily delegated to a group of humans each having 24h windows + notes from a manager (which can be replicated from 24h agents)
I assume such tasks must somehow involve learning + memory on task-specific subtasks of a form which is for example not easily learned/transferred from an instruction manual + short time of practice. For example, a physical task may ask you to become proficient at using a new type of machinery, whose proficiency cannot be easily attained from short-time-scale practice. But for cognitive tasks it is kind of more difficult for me to understand the type of task that requires experience-based learning. Perhaps if the task has a subtask which involves learning a novel subject/novel programming language etc.
For your logistics prediction, do you have any qualifiers in terms of scaffolding or prompting? I imagine AI probably could pull this off already with enough specialized scaffolding and hand holding in prompts, but that’s not that impressive imho.
I doubt that they could do these tasks robustly / well with scaffolding right now, or else we would probably have already seen lots of examples. (I do mean to implicitly rule out very overfit scaffolds where the scaffold is doing a lot of the work and you get a very cookie-cutter birthday party that doesn't flexibly incorporate preferences and complications.)