15 min read

Richard Sutton Thinks We Misread The Bitter Lesson

The strangest detail in the LLM boom is that its patron saint keeps declining the role.

For the past few years, The Bitter Lesson has been treated like a founding text for the scaling era. The argument sounded tailor-made for giant models: methods that exploit more computation and more data beat methods that bake in human cleverness. Venture decks nodded along. Research labs nodded harder. If bigger models kept winning, Sutton had already explained why.

Then Sutton started clarifying what he meant, and the fit got awkward.

He does agree that large language models are impressive users of computation. He also thinks they are stuffed with human knowledge in a way that should make anyone invoking The Bitter Lesson a little less comfortable. In his words, they are “clearly a way of using massive computation,” but they are “also a way of putting in lots of human knowledge.” That second clause changes the whole story. If your system depends on swallowing the traces of human thought at internet scale, Sutton does not see that as the clean version of the lesson. He sees a compromise, maybe a productive one, but still a compromise.

That is more than a semantic fight. It cuts to the deepest assumption in the current AI stack: that scaling text prediction is the general path, and everything else is an extension. Sutton’s view points in a harsher direction. The durable route, he keeps arguing, is not a bigger archive of what people already said. It is learning from experience, through action, feedback, and correction, in an environment that pushes back.

The industry used his essay to justify the present. Sutton is using it to warn about the next turn.

The essay people wanted to hear

The original 2019 essay landed because it explained a pattern computer science had seen for decades and still resisted emotionally. In speech recognition, computer vision, chess, Go, and much else, the methods that kept winning were not the ones with the most human structure delicately inserted. They were the ones that could absorb more compute and more data and improve with search or learning. Human insight still mattered, but mainly in designing systems that could scale. The artisanal parts aged badly.

That message hit at exactly the right time. Deep learning had already embarrassed many carefully engineered pipelines. Transformer scaling laws made the story legible to executives. If loss kept dropping as models got larger, then the path forward looked almost offensively simple: train bigger systems, on more data, with more compute, and let the machinery sort itself out.

You can see why LLM builders loved this interpretation. It made huge spending look principled rather than merely aggressive. It also gave a historical sheen to what might otherwise have looked like brute force. Nobody wants to say, “We bought a mountain of GPUs and hoped the curve would continue.” It sounds much better to say, “History shows that general methods win.”

There is truth in that. LLMs are a general method in an important sense. Nobody hand-coded English grammar into GPT-4. Nobody wrote down a taxonomy of sarcasm, travel booking, JavaScript, and breakup texts and wired it in by hand. The model learned broad statistical structure from scale, and that is a real break from older symbolic systems.

But that is not the whole comparison Sutton cares about.

His essay was never a generic blessing for any approach that gets larger. It was an argument about which ingredients remain productive as scale increases. Search and learning over raw experience keep paying off. Human priors, handcrafted decompositions, and domain-specific shortcuts often look smart early and then become liabilities when a more general method can exploit vastly more computation.

Once you read the essay that way, the current LLM regime looks less like the final form of the idea and more like an in-between stage.

LLMs run on distilled civilization

The key distinction is easy to miss because pretraining feels impersonal. A giant corpus looks like “data,” and data sounds like nature. It is not nature. It is people.

An LLM does not mostly learn by acting in the world and seeing the consequences. It learns by ingesting artifacts humans produced for their own reasons: articles, forum posts, source code, legal opinions, Wikipedia edits, fan fiction, bug reports, marketing copy, scientific papers, sermons, flame wars, and a truly heroic quantity of SEO sludge. What looks like raw fuel is actually compressed human labor. The model is trained on the residue of millions of minds solving problems, narrating events, arguing, documenting, joking, copying, and pretending to know what they are talking about.

That matters because The Bitter Lesson was written against a recurring temptation in AI: borrow as much intelligence as possible from people, then congratulate yourself for building a machine. Expert systems did it explicitly with rules. LLMs do it more indirectly, by absorbing gigantic libraries of human-generated traces. The mechanism is different. The dependence is still there.

Sutton’s formulation is blunt because he wants to protect the distinction. If the system’s competence comes partly from a reservoir of human knowledge externalized into text, code, and labels, then it is not just scaling on computation in the abstract. It is scaling on civilization’s prior work.

That does not make LLMs fake. It makes them historically legible. They are a very powerful interface to accumulated human culture. That is an incredible achievement. It is also a narrower claim than many people want to make.

The strongest evidence is in the failure modes. Ask a language model about a domain rich in public text, and it often feels uncannily competent. Ask it to operate where textual precedent is thin, feedback is delayed, and reality has sharp edges, and the performance gets slippery. A model can explain fluid dynamics beautifully while still being useless at designing an actually reliable pump. It can produce plausible legal reasoning and still invent a case. It can write code that passes the vibe check and fails the test suite. The gap is not mysterious. The training signal rewards predicting what humans would write next, not discovering what the world will permit.

People sometimes respond that text itself contains humanity’s experience, which is true. A chemistry paper is experience written down. Source code repositories are experience written down. Medical case notes are experience written down. But Sutton’s point is not that text contains no reality. It is that mediated experience has a ceiling. If your primary route to competence is whatever humans happened to record, then your growth path is chained to the exhaust of human activity. At some point, the available corpus becomes less like an open frontier and more like a mined field.

That is the part of the lesson the market keeps trying to blur.

The ceiling is not abstract anymore

For a while, “running out of data” sounded like a distant academic concern, the AI equivalent of worrying about hitting the speed of light on your morning commute. There was always more internet. There were more books, more code, more subtitles, more PDFs in government websites that even raccoons would decline to eat.

Now the issue feels less theoretical.

The best public text has already been heavily harvested. Labs are fighting over licensing deals because fresh, high-quality corpora have become strategically valuable. Synthetic data is rising partly because the easy natural supply is no longer easy. Training sets increasingly sweep in lower-quality text, duplicated content, machine-generated material, and multilingual fragments that are useful up to a point and noisy after that. The era of casually pretending the web is an infinite lake of pristine learning signal is ending.

This is where Sutton’s warning starts to bite. He asks whether these systems will “reach the limits of the data and be superseded by things that can get more data just from experience rather than from people.” If that happens, he says, it would be “another instance of the bitter lesson.”

That phrasing is devastating because it relocates LLMs inside the pattern they were supposed to have resolved. Instead of representing the victorious general method, they may turn out to be another strong-but-limited approach that benefited from a huge stockpile of human scaffolding and then plateaued.

The important word there is “superseded,” not “disproved.” Sutton is not saying language models are useless. The world plainly disagrees, and so would he. They are productive tools, useful interfaces, and remarkable compressors of collective knowledge. Supersession in AI usually works differently. It means the previous method remains valuable but loses its claim to being the main engine of progress. Chess engines did not stop using search when neural nets arrived; the winning systems fused methods in a way that changed what mattered. Symbolic components did not vanish from software once machine learning took off; they just stopped being the source of frontier gains.

That same pattern could hit LLMs. They may become a substrate, a prior, a user interface, a planner, a world-model-ish helper. The frontier may move elsewhere, toward systems that learn by doing rather than by reading what others did.

You can already see researchers feeling for the edges of this transition. Robotics needs interaction because language alone will not teach balance, friction, or recovery after failure. Autonomous coding agents need execution, tests, and deployment feedback, not just more GitHub prose. Scientific discovery needs experimental loops. Personal assistants need persistent memory and consequences. A chatbot without stakes is a very different machine from an agent that loses points, money, or access when it makes the wrong move.

The market hears “agent” and often imagines a bigger prompt with a browser tab. Sutton means something stricter.

Experience is not another word for more text

When Sutton says the scalable method is learning from experience, he is talking about systems that try things, observe outcomes, and update from the mismatch between expectation and reality. “No one has to tell you,” as he puts it. That line can sound almost primitive beside the ornate machinery of frontier models, but it names a hard requirement: the world has to answer back.

Language models mostly learn in a one-way relationship. During pretraining, the corpus does not resist. The target token is fixed. The model guesses, gets corrected, and repeats. This produces broad competence, but it is still competence with respect to a record. The learning process does not involve pursuing goals in an environment that changes in response to action.

Once you try to build those systems, three missing ingredients suddenly matter.

The first is an objective that is more grounded than “predict the next token.” Real agents need something like success and failure, however imperfectly defined. That might be winning a game, completing a purchase, proving a theorem, passing a software test, or navigating a room without crashing into a chair. Human language can describe these goals, but description is not the same as optimization pressure.

The second is some version of ground truth. In language modeling, “ground truth” often means the text humans wrote. In the world, it means whether the code compiled, the package arrived, the experiment replicated, the robot stayed upright, the customer churned, the booking actually cleared. These signals are messier than tokens. They are delayed, sparse, and sometimes ambiguous. They are also what keeps a system honest.

The third is a feedback loop dense enough to sustain improvement. This is where many current agents feel theatrical. They can call tools, but their learning does not yet come from long, cumulative interaction. They behave more like highly capable interns with amnesia than systems steadily building skill from consequences. That can still be useful. It is not the same thing as an architecture whose knowledge base grows primarily from its own experience.

Once you frame it this way, a lot of present-day AI engineering looks transitional. Reinforcement learning from human feedback gave chat models better manners, but it is still human preference shaping the objective. Tool use extends reach, but often without persistent learning. Synthetic data helps, but much of it is still generated from the model family itself, which is a clever form of self-distillation and occasionally a polite way of saying the student is photocopying its own notes.

None of this invalidates current progress. It just suggests the field may still be leaning on a temporary abundance: decades of digitized human output, plus enough compute to compress it into a responsive interface.

Lock-in happens because the shortcut works for a while

Sutton is also pointing to a recurring psychological pattern in AI. Researchers get “locked into the human knowledge approach,” he says, and then “get their lunch eaten by the methods that are truly scalable.” The phrase sounds casual. The history behind it is not.

This lock-in is not just intellectual stubbornness. It comes from incentives that reward visible progress on short timelines. Human-enabled methods usually work earlier. They shine in demos. They fit benchmarks. They are easier to debug because the source of competence is closer to us. If you add curated data, task-specific scaffolding, expert-written rules, chain-of-thought exemplars, hand-built evaluators, and carefully chosen tools, performance improves now. Investors like now. Product teams also like now, because payroll clears monthly.

Experience-based learning is harder to love at first. It is slower, messier, and more expensive to set up. You need environments, reward signals, simulators, safety boundaries, exploration strategies, memory, and some way to stop the system from discovering the digital equivalent of eating glue. The early curves often look bad compared with the polished gains from simply scaling a pretrained model on more human artifacts.

Then, if history is any guide, the curve bends.

This is the part of Sutton’s worldview that people either find deeply clarifying or mildly offensive. He keeps insisting that methods which can continue improving with more computation and self-generated experience will eventually outrun methods that rely on human supply chains. Not because human knowledge is bad, but because it bottlenecks the system at the rate and form of human production. The moment the machine can collect more relevant experience than people can conveniently write down, the center of gravity shifts.

The current boom has all the ingredients for lock-in. The products are already shipping. The revenue is real. The benchmarks still move when models get bigger. An entire ecosystem now depends on the assumption that text-centric pretraining remains the master resource and that agency can be layered on later, perhaps almost as a product feature. That assumption could hold longer than critics expect. Markets can extend local truths impressively far.

But lock-in is dangerous precisely when the present keeps paying.

The likely future is a hybrid, with the center of learning moving outward

It would be too clean to say that text-based models are old news and experiential agents are the future, full stop. The more plausible outcome is a layered system.

Large pretrained models are excellent priors. They provide language, abstractions, world knowledge, code syntax, planning heuristics, and a broad sense of how humans describe tasks. Throwing that away would be like insisting a child should rediscover algebra from scratch because textbooks are technically human knowledge. Nobody serious wants that.

The more interesting shift is where the frontier value comes from. If Sutton is right, the decisive improvements will increasingly come from loops that attach these models to environments where they can act, fail, recover, and accumulate knowledge that was not simply sitting in a corpus waiting to be compressed. Pretraining may become the starting literacy. It will not be the whole education.

You can imagine this transition in software first because the environment is unusually cooperative. Code has execution. Tests produce concrete signals. Sandboxes are cheap. Rewards can be approximate and still useful. A system that writes code, runs it, inspects the error, revises, and remembers what worked is already closer to Sutton’s ideal than a system that only predicts plausible snippets from repository statistics. Similar logic applies in scientific simulation, logistics, finance, and robotics, though the safety costs rise fast once atoms are involved.

There is also a political economy buried here. Pretraining on human text concentrates power around whoever controls compute, data licenses, and model distribution. Experience-driven systems may redistribute some advantage toward whoever controls environments, interfaces, deployment contexts, and feedback loops. The next competitive moat may not be one more trillion tokens. It may be continuous access to high-quality interaction at scale.

That would change what “data” means in boardrooms. A document archive is static capital. An interactive environment is a living one. It keeps producing learning signal every time the system acts.

Sutton’s argument is sharp because it does not flatter the current winners. It says the field may once again be mistaking an effective bridge for the destination. LLMs are powerful because they let machines inherit a huge amount of humanity’s written cognition in one gulp. That inheritance has carried the industry astonishingly far. It may also be the reason this phase has a ceiling.

If that ceiling arrives, the irony will be hard to miss. The essay used to defend the age of giant language models will have predicted their demotion all along.

End of entry.

Published April 2026