Automation's Asymptote: Part 2

Yesterday's piece ended on a question I could not quite resolve: it can be simultaneously true that humans remain in the loop and that the economics of being in the loop deteriorate. Shipper's frame-and-framer argument gave me a philosophical reason to doubt the automation doom, but philosophy can be a cold comfort if your paycheck is tied to a frame the models are about to climb. You can go to Mercor and see for yourself which skillsets are currently in the process of being RL-ed away.

Source: Mercor

How such automation affects different industries and companies is going to be a key question to ponder going forward. As I said yesterday, trillion dollars are being deployed on an annual basis to automate a good chunk of current knowledge work. Tom Reed wrote a very good piece last month arguing that we may be pursuing what he calls “Goodhart Singularity”. Reed’s counter to automation doom is disarmingly simple: you cannot get good at solving problems without access to a source of problems, and the only source of most problems is slow, expensive interaction with the real world. Without that contact, the recursive loop produces something far less impressive than advertised. From Reed’s piece:

“The output of the R&D produced by an isolated datacenter of geniuses would be a mere Goodhart Singularity.4 An isolated AI improving itself against benchmarks would only appear to be approaching superintelligence, while actually optimising for eval performance that fails to generalise beyond the lab.”

Why would self-improvement stall outside the lab? Because models get good at what they practice, and for most economically valuable work, there is nothing to practice on. Reed’s most clarifying observation is about what kind of data exists at all:

“For most tasks in the economy, the pretraining corpus contains writing about the task, but not a record of the task itself. This is of course one of many reasons coding has progressed faster than other domains - code is one of the neat cases for which the task itself is almost entirely reducible to its token trace.”

The internet contains commentary and advice in abundance, but the actual steps of closing an M&A deal or deciding which drone prototype to ship were never serialized into tokens. The natural rebuttal is that a sufficiently smart system can simulate whatever data it lacks. Reed is skeptical that simulation is a viable path:

“Consider that almost half of SWE-Bench submissions accepted by AI auto-graders would be rejected by the actual human maintainers of the relevant repositories. The fact that you can pump SWE-bench scores without increasing actual merge rates is, to me, suggestive of the situation the datacenter-genius will find itself in.

The great Zhengdong makes this point about the progress of AI research itself. Not only are “evals” the only things that models are capable of getting good at, but “the researchers [themselves], they just wanna optimise… they just want an important problem to solve, a clear evaluation that measures progress towards it, and then they just wanna optimise it.” I suggest that AI companies need real-world deployment as a source of problems, or else they will have no good targets for optimisation.”

The signal that something is good is generated by millions of market actors revealing their preferences through behavior, and it does not exist anywhere before deployment generates it. If Reed is right, diffusion itself is an input to capability. The benchmark frame is climbable precisely because it is frozen, but does or can the economy ever stop generating new “frames”?

The standard objection here is “sample efficiency”: sure, the data does not exist today, but what if the models become more efficient learners, and the automated researchers will crack the learning problem itself. Dwarkesh recently wrote a piece in which he mentioned sample efficiency hasn’t been a key source for model improvement:

“One definition of intelligence is sample efficiency - that is to say, how much data do you need to see in a given domain in order to operate fluently and competently. It’s not clear that we’ve actually made much progress on training sample efficiency over the last few years - it seems like more so we’ve dramatically widened and improved the data distribution.

The main way that AIs have been getting better is from adding more and better data, and scaling the compute to develop that data in the first place. Obviously RL is the main way that has happened. You can think of RL as a kind of synthetic data generation - you dump a lot of compute against a verifier in order to find the “good” data. Then you train your model to predict these correct rollouts, much in the same way that you might train it to predict the next word in internet text.”

Once you see RL as synthetic data generation against a verifier, you can understand the model needs enormous quantities of human expert demonstrations in every domain you want competence in. Dwarkesh does, however, also argue that sample inefficiency may simply not matter for a huge swath of white-collar work. What the model learns amortizes across billions of sessions, so you can be ludicrously inefficient in training and still make enormous economic sense to pursue it as the labs’ revenue curves demonstrate.

If you put Reed and Dwarkesh side by side, there is a sense of a circularity: the researchers in the datacenter would need a real, diverse problem set just to evaluate whether a candidate learning algorithm is better, but that problem set is precisely what the datacenter lacks in many, many domains.

So if capability runs through deployment, and the manufactured substitute for real-world signal is staggeringly expensive and bespoke, where does the value accrue? Sarah Guo wrote an interesting essay on that question and she makes the case that anything you can measure and you can train against, that is already on its way to a commodity. Her 2x2 maps work along two axes: whether the answer is publicly verifiable or lives in private data, and whether the task is still at the frontier or already saturated. From Guo’s piece:

“we may ask two things of any kind of work. Is its correctness private and expensive to establish, the kind of truth that exists only inside someone’s data? And is it walled off, locked inside a system you can’t get into? Set those against how saturated the task is, and you get a 2x2. Saturated work with public answers is commodity tokens, and open models own it. Frontier work with public answers, where coding benchmarks live, is where the labs win, because when the eval is free, owning it counts for nothing. The prize is the last corner, the untrainable one: frontier work whose correctness exists only in private. You can see it in the inference clouds hosting the AI-native pioneers, where the vast majority of tokens are generated by custom models, not generic open ones.

…Capability eats many things, but a better model does not make private ground truth public. It does not hold the license, sign off on the liability, or own the firm’s files, and it cannot be the party that gets sued when the answer is wrong. Intelligence is not the bottleneck here. Permission is, and so is accountability. You can imagine a model far smarter than any person, and it still has to be let in the door, and someone still has to put their name on what it does.”

Coding matured first because the compiler and the test suite are free verifiers, yet even there Guo cited researchers at MIT’s work spanning over 100,000 developers: coding agents lifted code written by roughly 180% while code actually shipped rose only about 30%. The gap between those numbers is the illegible part of the job that is still hard to automate away.

The big prize for non-AI labs is the frontier work whose correctness exists only inside a firm's private data. This 2x2 framework is useful way to think through the risk labs may pose to any software company. For the more visually inclined, Latent Space captured this 2x2 framework in the below infographic:

Source: Latent Space

Subscribers get the daily journal and five+ years of Deep Dives, i.e. full-length analyses with financial models on 65+ companies. The daily is just how I think out loud between the Deep Dives!


Current Portfolio:

Please note that these are NOT my recommendation to buy/sell these securities, but just disclosure from my end so that you can assess potential biases that I may have because of my own personal portfolio holdings. Always consider my write-up my personal investing journal and never forget my objectives, risk tolerance, and constraints may have no resemblance to yours.

My current portfolio is disclosed below:

This post is for paying subscribers only

Already have an account? Sign in.

Subscribe to MBI Deep Dives

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe