By MBI Deep Dives — Jun 10, 2026

Automation's Asymptote

Anthropic launched “Claude Fable 5” yesterday which is the much anticipated “Mythos-class” model. Anthropic was already leading in most benchmark with its Opus 4.8 model, but Fable 5 created a bit more distance from its competitors. Anthropic’s pace of model release even when its setting the bar at the frontier does give some credence to the notion of “recursive self improvement” of building models. Staring at the benchmark chart may make you wonder whether we are indeed at the cusp of automating a good chunk of white collar work.

Benchmark table showing Claude Fable and Mythos compared to other leading models — Source: Anthropic

Even OpenAI recently explicitly laid out an audacious vision: “Our internal belief is that by March of 2028 we may have a significant fraction of our research being done by AI systems in tandem with our own researchers.”

When trillion+ dollars are expected to be deployed on an annual basis to build and serve these models, you perhaps do need these eyebrows raising goals in not-so-distant future. Nonetheless, as the model capabilities are increasing over time, it is perhaps reasonable to feel increasing discomfort due to fear of automation. During my break, I read this very thoughtful piece: “After Automation” by Every’s Dan Shipper who made the case that you can be simultaneously AI-pilled and yet not fear the impending automation doom. He also appeared on Lenny’s podcast to discuss the piece. I have read and listened to both and frankly speaking, I found it to be one of the best articulations tackling this topic. So, I recommend you take the time and either read the full piece or listen to the podcast. I do want to highlight a few bits from Shipper’s piece that I found to be quite compelling.

One of the highlights from his piece is a discussion on in-house benchmark that Shipper came up with and how the score on that benchmark evolved:

“We built an in-house benchmark called the Senior Engineer benchmark. It is, as its name implies, designed to test how good frontier models are at senior engineer–level coding tasks like a major refactor.

The Senior Engineer benchmark gives a coding agent a vibe coded production codebase that has gone sideways. It’s from a real codebase for Proof that I vibe coded and subsequently needed a senior engineer to fix.

The agent gets the codebase as it was before it was fixed and is the kind of instructions you’d give a senior engineer: “This is vibe coded slop; please rewrite it from first principles.”

This is a good benchmark because it tests the ability for a coding agent to examine many different, unrelated problems and then sees whether it has enough autonomy, conceptual clarity, and courage to perform a working rewrite. (I also have two rewrites from human senior engineers, who used AI, that I use to compare and grade the model output.)

Coding agents find this task hard. Not only does the agent need to find the root of the problem, it needs to keep the problem in mind over many turns without getting distracted by existing code. It also needs to be comfortable deleting large portions of the codebase—which agents are trained to avoid.

Most coding agents can identify the shape of the rewrite, but when it comes to execution, they patch over the problem instead of fixing it.

Until GPT-5.5.

GPT-5.5 scored a 62/100 on its best run—about 30 points above Opus 4.7.⁽⁶⁾

GPT-5.5’s result felt like the model has crossed a line: not autocomplete, not assistant, not tool, but something uncomfortably close to a human. A human senior engineer scores in the high 80s or low 90s on the benchmark, so another 30 points and it’ll be at human senior engineer level. That is how benchmark numbers work on the imagination: They turn a strange, qualitative change into a clean number that tells a powerful—scary—story. (Next stop: chart psychosis.)

My guess is that the models will hit the 80s and 90s on this benchmark within the next year. But it is important to understand what the score contains in order to tell us what it means. In this case, the 62 isn’t just a measure of the model itself.”

It turns out even Shipper underestimated the pace of improvement as he revealed yesterday that Fable 5 scored 91 on this benchmark! To his credit, Shipper never intended to be married to a particular timeline of when the benchmark will be saturated by the models. He, in fact, laid out why the score itself reveals far less than what most people may think. Again, from his piece:

The prompt for the Senior Engineer benchmark is generic, but it is a frame. And if we varied it, we would see the model perform at a different level.

For example, the prompt asks for a “structural rewrite from first principles,” it says the problem is likely in the “document collaboration” part of the code, and it asks the coding agent to find and hold to “invariants.”

If we removed those particulars, the score would go down. If we replaced the prompt entirely with one asking the model to “solve all of the errors that keep popping up,” the model’s score would be close to zero. It would go straight to identifying and resolving the issues one by one, instead of taking a step back to consider a rewrite.⁽⁸⁾

I can also trivially raise the model’s score. If I ask it to delete a lot of code and give it exact filenames that should be pared down, or if I ask it to check the results of its work to make sure the app is fully functional before it says it’s done, it will be better at the task.

…Once the current Senior Engineer benchmark saturates, we’ll change the frame to zero it out again.

The next benchmark will not ask only, “Can you rewrite the app?” It will ask: Can you decide when a rewrite is needed, choose the scope, preserve the right invariants, manage the migration, and judge whether the result is any good?

As senior engineers use AI to solve these problems, the models will get better at solving them on their own.

We will all momentarily freak out. It looks like the model can now decide whether to do a rewrite! They can do everything a senior engineer can do!

And then a new edge will appear that was not obvious before, we will zero our benchmarks, demand will stimulate, and the process will repeat

Shipper framed the dynamic as Zeno’s paradox: humanity is the tortoise with a fifty-yard head start, the model is Achilles, and every time the model reaches where we stood, we have moved because saturating a benchmark immediately prompts us to redraw it. Progress inside any fixed “frame” is exponential, but progress against the moving frontier of what humans actually value behaves like an asymptote. Shipper makes more of a philosophical point here, but it does ring true to me:

“The panic that AI generates when we observe it doing something new keeps coming back to this: We set a frame, watch the models climb it, and then confuse the frame—or whatever climbs it—with the thing itself.

When we look at a benchmark and compare it to human abilities, we confuse the frame for the framer. The score tells us how well the model operates inside a frame we supplied; it does not tell us that the model has become us.

That is the category error underneath the panic. We point to the latest edge we drew and say: This is us. Then, when the model climbs it, it feels like it has caught us. But it has caught the frame, not the framer.

The mistake is wanting something concrete to hold on to. We want to say: Intelligence is this benchmark, but the moment something is concrete enough to point at, it is concrete enough to climb.

Frames are necessary; they let us get traction on the world. But they are frozen, partial, and therefore optimizable.

Framers are different. The framer is the one still in contact with what the frame has to discard—the whole situation as it appears to them, moment to moment.

What is this “whole situation”? The moment you start to say what “the whole situation” contains, you have already begun another frame. You can’t say what “it” is, but it exists because you exist.”

I do wonder, however, that it can be simultaneously true that humans remain in this loop and that the economics of being in the loop can deteriorate. If reviewing AI-generated pull requests becomes the core of the engineering job, the supply of people who can do that job adequately may expand faster than the demand for it, especially when the models themselves are being trained, release by release, on exactly the review-and-framing behavior we perform. To what extent most people can retain or improve their economic worth in a world where AI models continue to climb the hill is perhaps an open question.

Subscribers get the daily journal and five+ years of Deep Dives, i.e. full-length analyses with financial models on 65+ companies. The daily is just how I think out loud between the Deep Dives!

Disclaimer: All posts on “MBI Deep Dives” are for informational purposes only. This is NOT a recommendation to buy or sell securities discussed. Please do your own work before investing your money.

Subscribe to MBI Deep Dives