Vibe Coding in Production: When AI Writes 22,000 Lines and Nobody Reads Them

Anthropic merged 22,000 lines of AI-written code into production. No human sat down and read that code end to end with the care we still associate with serious engineering. If that sentence does not make you uneasy, you are not paying attention.

My first response was simple: this sounds reckless. The second response arrived a few seconds later: this is probably where software is headed.

What Erik Schluntz described in Anthropic’s presentation was not a stunt. It was a glimpse of a development model that stops treating code as the primary object of attention. That is the real shock. We built an industry around the idea that good software comes from humans reading, writing, and reviewing source files. Now the volume is starting to outrun the reading capacity of the people supposedly in charge.

This is not just a tooling upgrade. It is a change in what “doing software” means.

The curve is the story

The most important fact in this conversation is easy to miss because it sounds abstract. METR’s research suggests that the length of tasks frontier models can complete is doubling on a timescale of roughly seven months. The exact benchmark details matter, and these trends are not laws of physics, but the shape of the curve matters more than any single number.

A model that can reliably handle an hour-long engineering task is not just a better autocomplete. It is a system that can keep state, recover intent, and execute a sequence of choices over time. Stretch that horizon forward by a year or two and you stop talking about snippets. You start talking about features, migrations, debugging sessions, integration work, and ugly maintenance tasks that used to consume whole days.

Most teams are still thinking in linear terms. How many tickets can we finish this sprint? How many lines changed? How many reviewers approved the pull request? Those habits made sense when the expensive thing was generating code. They make much less sense when generation becomes cheap and the expensive thing becomes deciding what should exist, proving that it behaves correctly, and containing the blast radius when it does not.

That is why the 22,000-line example lands so hard. It is not impressive because the number is large. Large diffs happen all the time. It matters because it makes an old assumption feel suddenly fragile: that responsible engineering requires direct human inspection of most of the code that ships.

For decades, our faith rested on a rough chain of custody. A developer wrote code. Another developer read it. A test suite caught some issues. Production revealed the rest. AI agents break the first link by flooding the system with output faster than a team can absorb. Once that happens, the rest of the workflow has to be redesigned.

“Vibe coding” is a terrible phrase for a real shift

The label does not help. “Vibe coding” sounds like a joke, or worse, a brand-safe way to say “I stopped caring.” That is why many engineers hear the term and reach for the nearest fire extinguisher.

Karpathy’s original formulation was provocative on purpose. Let the model write the code. Stay focused on the shape of the thing you want. Do not obsess over every implementation detail. Taken literally, that sounds irresponsible. Taken seriously, it captures something important: the center of gravity is moving away from typing and toward specification.

There is a weak version of this idea, and a strong version.

The weak version is familiar already. You use Copilot, Cursor, or Claude to generate boilerplate, speed up refactors, or draft tests. You still read almost everything. The machine is a productivity layer wrapped around a conventional process.

The strong version is different. You accept that some meaningful share of production code will never be read line by line by a human. You stop pretending otherwise. Then you rebuild your process around that reality.

Anthropic’s story matters because it was clearly the strong version. The code was not a toy app. It supported reinforcement learning infrastructure, which means it lived in a part of the stack tied closely to actual business and research outcomes. That does not prove every company should copy the move tomorrow. It does prove that “serious teams will never do this” is already outdated.

Reading code stops being the bottleneck and starts being the wrong unit

There is an old analogy here, and it is useful up to a point. Most developers do not inspect the assembly generated by their compiler. They trust layers of abstraction, then verify the behavior that matters. High-level languages won because reading machine output is not how humans should spend scarce attention.

AI-generated code invites a similar shift, but it is not identical. A compiler translates a precise program according to deterministic rules. A model generates plausible implementations under uncertainty. That difference is not cosmetic. It means you cannot simply transfer trust from one category to the other and call it a day.

Still, the direction is similar. Once an agent can generate more code in an afternoon than your team can responsibly review, “read all the code” becomes less a standard than a comforting ritual. A ritual can still catch some bugs, but it no longer scales with output. Review has to move up a level.

That means reviewing interfaces, invariants, failure modes, and test coverage rather than staring at every branch condition. It means evaluating whether the system honors contracts under pressure, not whether each helper function has pleasing prose. It means you care more about whether the generated migration preserves data than whether the loop body feels elegant.

For many engineers, this sounds like surrender. In one sense, it is. You are giving up the intimate, line-by-line relationship with source code that shaped the craft for decades. But clinging to that relationship while agents accelerate around you will not preserve quality. It will just create process theater.

The safe version starts with architecture

Schluntz’s most practical insight was also the least flashy. Put AI-generated code in leaf nodes first. In other words, target parts of the system that other components do not deeply depend on. If the model produces something awkward, or subtly wrong, the damage stays local.

This is a boring discipline, which is usually how you know it matters.

Think of a codebase like a city’s water system. The trunk lines matter more than the faucets. If you want to experiment with fast, cheap construction, you do it at the edges before you rebuild the central pressure network. The same logic applies here. Generated code is easiest to tolerate where the coupling is low, the contracts are clear, and rollback is straightforward.

Humans should still own the structural beams for now: core data models, authorization boundaries, payment flows, security-sensitive paths, anything with irreversible real-world effects. Those areas need more than correctness in the narrow sense. They need legibility, predictability, and institutional memory. An agent can help there, but handing it the pen completely is a different kind of bet.

This architectural split also changes how teams should think about modularity. For years, modular design has been sold as a maintainability virtue. Now it becomes a governance mechanism. A well-separated codebase gives you zones where high-autonomy generation is acceptable and zones where it is not. The same dependency graph that once mattered for scaling teams now matters for scaling machine authorship.

The real work moves into specification

If code is cheaper, ambiguity becomes more expensive.

That is the part many teams are about to learn the hard way. When an engineer writes code manually, they often resolve small ambiguities on the fly. Their understanding evolves as they build. When an agent writes the code, that slippage gets externalized. Vague intent becomes concrete output, usually wrapped in impressive confidence.

So the scarce skill shifts upstream. The valuable engineer is not just the person who can implement a feature. It is the person who can define the feature so clearly that implementation has little room to drift. Good prompts are not the point. Good specifications are.

Anthropic reportedly spent days on the requirements for that large generated change. That ratio is revealing. The human effort did not disappear. It moved. Instead of pouring hours into syntax and local reasoning, people invested time in constraints, expected behavior, evaluation criteria, and system boundaries.

That sounds suspiciously like product management, and in part it is. But it is not “developers become PMs” in the simplistic sense. It is closer to this: engineering judgment migrates upward from execution toward definition. The best technical people will still need deep implementation knowledge, because the only way to write a good spec is to understand how systems fail.

This is where the romantic image of vibe coding falls apart. You do not get reliable production outcomes by “going with the vibes.” You get them by being far more precise about inputs and much more disciplined about verification than many teams have ever needed to be. The irony is almost funny.

Testing becomes the primary reading method

If you are not reading every line, how do you know what you shipped?

You ask the software to reveal itself through behavior.

That means stronger test suites, but it also means different tests. Traditional unit tests still matter, though agents can produce those by the bucket and sometimes with the same enthusiasm a junior developer has for adding mocks to everything. The more important layer is behavioral evaluation: integration tests, property-based tests, stress tests, replay against production-like traffic, canary deployments, anomaly detection, rollback hooks.

You are not trying to prove that the implementation is elegant. You are trying to establish that it respects the contract in enough environments to deserve trust. This sounds obvious, yet much of code review culture still rewards readability signals over operational assurance. AI-generated code makes that imbalance harder to defend.

Observability also stops being a nice-to-have. If you are shipping code that no one has deeply inspected, you need excellent telemetry once it is live. Logs, traces, metrics, drift alerts, resource usage patterns, user-visible error rates, weird edge-case spikes at 3 a.m. These become part of the review process, just delayed into runtime. Production turns into an active validation layer rather than the place where you passively hope nothing weird happens.

None of this eliminates the need for human judgment. It changes where judgment is applied. The human reviewer becomes more like an auditor of behavior and less like a copy editor of source files.

Responsibility gets blurrier, not smaller

One of the more uncomfortable consequences of this shift is that ownership becomes harder to narrate. Git blame was always a crude tool, but at least it mapped roughly onto authorship. If most of a module was drafted, revised, and expanded by an agent, what exactly does authorship mean?

The answer cannot be “the model did it,” because models do not hold pager duty, answer customers, or testify in a postmortem. Responsibility remains human and organizational, even if authorship becomes distributed across prompts, generated drafts, evaluation harnesses, and approving reviewers.

That sounds manageable until something fails in a subtle way. A security bug caused by a model’s shaky library usage is not meaningfully solved by identifying which engineer clicked “accept.” The accountable party is the team that created the conditions under which opaque generated code could ship without sufficient safeguards. This pushes organizations toward process accountability rather than individual heroics.

It also changes what competence looks like. The strongest engineers in this world may not be the ones who can hand-write the cleverest parser under pressure. They may be the ones who can design a system where generated code is constrained, testable, reversible, and observable. That is a different prestige economy, and it will irritate people who built their identity on local implementation brilliance.

Junior engineers face the strangest transition

There is a real education problem here, and it should not be waved away with cheerful talk about “AI-native talent.”

People learn software partly by wrestling with details. You debug the ugly stack trace. You write the awkward first version. You read old code and slowly develop taste about what holds up. If agents absorb too much of that friction, some juniors may ship features without building the mental models that make senior judgment possible.

At the same time, refusing the tools will not save anyone. A junior developer with a strong model can explore a codebase faster, ask better questions, and test more ideas than a junior without one. The issue is not tool access. It is whether teams preserve the learning loop.

That probably means being more intentional about what humans still have to understand directly. Reading generated code selectively still matters. Re-implementing small components by hand still matters. Tracing failures back through a generated system still matters. The apprenticeship does not disappear, but it becomes less automatic. Managers will have to create it on purpose instead of assuming the work itself will provide it.

There is a deeper cultural risk too. If senior engineers get rewarded for shipping outcomes while juniors become prompt shepherds for code they cannot explain, the profession could develop a hollow middle. Plenty of output, thinner intuition. That is not inevitable, but it is easy to imagine.

This will spread because the economics are too strong

It is tempting to treat Anthropic as a special case. Frontier lab, unusual tooling, strong internal expertise, atypical risk tolerance. All true. It is also tempting to say that ordinary companies cannot adopt these practices safely. Sometimes true, sometimes self-protective.

The bigger point is economic. If one team can define work clearly, delegate the bulk implementation to agents, and verify results with strong testing and observability, that team will move much faster. Speed is not everything, but when the gap gets wide enough it starts to reshape norms. People stop arguing about whether the change is tasteful and start asking how to capture the gains without blowing up production.

This adoption will not be uniform. Highly regulated industries will move more slowly. Safety-critical systems will keep a larger human-authored core. Legacy codebases with tangled dependencies will struggle because they lack the modular boundaries that make high-autonomy generation tolerable. But the direction is hard to miss. The teams that figure out how to turn architecture, specifications, and evals into leverage will outpace teams still organizing around artisanal diff review.

And there is a second-order effect. Once software gets cheaper to produce, organizations ask for more of it. More internal tools, more experiments, more integrations, more customized workflows. The demand expands to absorb the supply. That means the pressure to use agents will not just come from engineering leaders chasing efficiency. It will come from the rest of the company discovering that many requests once dismissed as too expensive are now suddenly feasible.

The craft survives, but it changes shape

The sentimental version of software culture says the craft lives in the code. There is truth in that. Good code reflects taste, restraint, and accumulated judgment. Many of us learned to love the work through that intimacy.

But the industry may be entering a phase where the most important craft lives one layer up. In the architecture that keeps generated work contained. In the specification that turns ambiguity into structure. In the eval suite that catches what no reviewer had time to read. In the judgment about which parts of a system deserve legibility above speed, and which parts can be safely delegated to machines.

That does not make code irrelevant. It makes direct authorship less central than it used to be. Some engineers will hate that shift because it feels like losing contact with the material itself. They are not wrong to feel the loss. Yet production systems have always forced us to trade intimacy for leverage. We stopped toggling bits by hand, then stopped writing assembly, then stopped managing infrastructure box by box. Each layer abstracted something away and demanded a new kind of discipline in return.

AI-generated production code is another version of that bargain. The teams that thrive will not be the ones who trust the machine blindly, and not the ones who insist on preserving every old ritual. They will be the ones who learn how to specify clearly, isolate risk, verify behavior aggressively, and reserve human attention for the decisions that actually deserve it.