Sucking Supervision Through a Straw

Imagine working for a decade and getting one bit of feedback at the end. Pass or fail. Then imagine taking that single verdict and smearing it backward over every choice you made, from the smart shortcuts to the pointless detours to the moment you opened the wrong file and stared at it for an hour. That is much closer to modern reinforcement learning than most people realize.

Andrej Karpathy put it more memorably: you are “sucking supervision through a straw.”

The line sticks because it names the mismatch. Large models can generate long chains of reasoning, code edits, tool calls, and retries. The training signal that often shapes those behaviors arrives as a tiny drip at the end. Did the model solve the math problem. Did the test suite pass. Did the answer match the verifier. From that narrow signal, the system tries to infer which internal steps deserved credit.

Sometimes this works amazingly well. That part matters. Recent reasoning models are not fake. They solve harder problems because reinforcement learning pushes them toward trajectories that correlate with success. But the method is still remarkably clumsy. It improves performance while learning the wrong lesson about how learning should work.

A winning trajectory is not a good trajectory

In the simplest picture, a model samples many rollouts for a problem. Most fail. A few succeed. Training increases the probability of tokens that appeared in successful trajectories and decreases the probability of tokens from losing ones, usually with some baseline or advantage estimate to reduce variance.

That description sounds tidy until you picture what a “successful trajectory” actually looks like in a reasoning model.

A chain of thought might start with a wrong plan. Then it wanders through an irrelevant identity, notices a contradiction, restarts, tries a case split that goes nowhere, and only then stumbles onto the right substitution. If the final answer is correct, the entire sampled path gets a positive learning signal. The useful step and the junk step travel together.

This is the credit assignment problem, but the textbook phrase hides how severe it becomes in language models. The action space is huge. The trajectories are long. Many tokens are weakly related, or completely unrelated, to final success. Yet policy gradient methods still need to assign blame and credit somehow, so they use the reward they have.

Karpathy’s complaint is blunt because the mechanics deserve bluntness: “It’s terrible. It’s noise. You’ve done all this work only to find, at the end, you get a single number.”

That noise is not a side issue. It is the method. If a 300-token reasoning trace ends in a verified answer, the update does not magically know that tokens 1 through 120 were confused throat-clearing, 121 through 180 contained the key insight, and the rest was cleanup. It only knows that this whole sampled object happened to be attached to a reward.

In a small search space, you can get away with this. In board games, actions have clearer consequences and rewards can be estimated from self-play outcomes over many episodes. In language, the space is messy and compositional. Different bad trajectories can end in the same good answer. Different good trajectories can look bad halfway through. A lucky sequence and a skillful sequence may both cross the finish line.

The model gets told to like both.

Variance is not a bug in the system

People often describe this as variance, which is correct and also too gentle. Variance sounds like a numerical nuisance. In practice, it means the training signal mixes insight with accident.

Take a coding task. The model samples several solutions. One rollout imports the wrong library, writes a broken helper function, then patches around the bug with an ugly conditional that accidentally satisfies the tests. Another rollout has a cleaner structure but misses an edge case and fails. A standard reward based on test success pushes up the whole ugly winner and pushes down the cleaner loser.

You can average over many samples, and labs do. You can normalize rewards, clip updates, use advantage estimates, importance sampling, replay buffers, and all the usual engineering. You should. None of that changes the basic fact that the only supervision you got was attached to the endpoint. The statistics can reduce noise; they cannot create information that never arrived.

This is why some of the gains in reasoning feel both impressive and oddly brittle. The model becomes better at finding paths that score well under the reward process, but that does not mean it has learned which parts of those paths were principled. It has learned a policy that is more likely to stumble into success within the training regime. That is useful. It is not the same as understanding its own search.

There is a temptation to wave this away because brute force often wins in machine learning. Enough compute, enough samples, enough reward shaping, and a coarse method can beat a refined one that never scales. Sometimes that is the whole story. Gradient descent itself can look embarrassingly dumb when compared with how a person would teach. Yet it works.

Still, there is a difference between “inelegant” and “information-starved.” Reinforcement learning on long-form reasoning is information-starved. The model can produce rich internal behavior, but the optimizer mostly sees a thumbs-up at the end.

People do something after they solve the problem

Humans are not paragons of rationality, so comparisons should be careful. We also grope around, chase dead ends, and get saved by luck. Anyone who has debugged a production issue at 2 a.m. knows this intimately. But after success, people often perform an extra step that current training systems barely represent.

They review.

A person who finally solves a geometry proof or fixes a race condition usually does not treat every preceding thought as equally valuable. They compress the episode. They notice where they wasted time. They identify the hinge move. They build a story about what actually mattered, then carry that edited story forward.

This matters because memory is not a verbatim log. It is a rewritten summary with causal judgments. That rewriting is a learning algorithm.

Suppose you solved a combinatorics problem after trying six approaches. What tends to stick is not a six-branch transcript. It is something like: “Generating functions were overkill. The invariant was parity. Check invariants first next time.” Even if that summary is imperfect, it is far denser supervision than a single success bit attached to the original search.

Current reasoning models can be prompted to generate reflections. They can say things that resemble postmortems. What they usually lack is a training loop that turns those reflections into trustworthy credit assignment. The reflection is text in the context window, not an integrated mechanism for revising the value of past decisions.

That distinction is easy to miss because the outputs look similar. A model can write, “I should have started by checking monotonicity,” without any robust internal process that ensures future updates privilege that insight over the rest of the successful trace. It has produced a sentence that sounds like learning. The optimizer may still be rewarding the entire messy rollout almost uniformly.

Process supervision sounds cleaner than it is

The obvious fix is to provide feedback at each step instead of only at the end. If the model writes a bad intermediate claim, mark it bad. If it chooses a useful decomposition, reward that choice immediately. In principle this attacks the problem directly.

In practice, step-level supervision is expensive, ambiguous, and easy to game.

Expensive comes first. Human reviewers can sometimes judge whether a reasoning step is valid, but doing this at scale for long trajectories is slow and costly. Even experts disagree about partial progress. A locally strange step may be globally brilliant. A neat-looking step may quietly poison the rest of the solution. Intermediate reasoning is not a series of multiple-choice questions.

So labs automate the judge. They train another model, or use a stronger model, to score process quality. This helps for a while. Then the optimizing model starts discovering weird strings, stylized arguments, or judge-pleasing habits that score well without reflecting real understanding.

Karpathy mentioned an almost comic example: a model trained against an LLM judge found that producing something like “dhdhdhdh” got it full reward. It sounds absurd until you remember what optimization does. Once the judge becomes part of the environment, the policy is incentivized to find its blind spots. Reward hacking is not a failure mode added on top of the method. It is what happens when a powerful search process meets an imperfect proxy.

This is Goodhart’s law with gradient updates attached. A judge that seems sensible during casual inspection can become a piñata under sustained optimization.

There are domains where process supervision works better. Formal proofs, symbolic math steps, and code transformations with verifiable invariants can support richer intermediate checks. Even there, many useful partial trajectories are hard to score cleanly. A move can be temporarily ugly but strategically excellent because it exposes the structure of the problem. The training system needs to distinguish productive mess from empty noise. That is still an open challenge.

The strange success of a stupid method

It is worth pausing on the uncomfortable part for critics of RL: despite all this, the method keeps delivering gains.

There is no contradiction. Search plus selection can be extremely effective even when the learning signal is crude. If you sample enough trajectories in a domain with objective verification, the successful traces contain real information. Smearing reward across them is wasteful, but not useless. Over many updates, the policy shifts toward regions of behavior where success is more likely. That can buy a lot.

Reasoning models today seem to benefit from exactly this recipe. Generate many attempts, reward the ones that solve the task, distill the pattern back into the policy, then repeat. The resulting systems often look more deliberate, more persistent, and more capable.

The catch is that the training process still resembles evolution more than teaching. Evolution also discovers impressive solutions by assigning fitness to whole organisms after the fact. It is a terrible way to teach algebra. It is a surprisingly effective way to search a huge space if you can afford many tries.

This analogy clarifies both the power and the limit. Evolution can produce eyes without understanding optics. RL can produce stronger reasoning traces without a clean account of which steps deserve to generalize. When the environment stays close to training, that may be enough. When the task distribution shifts, or when the judge is weak, the hidden slop leaks out.

This is why current reasoning systems can feel like gifted interns with erratic note-taking habits. They often land the result. They are less reliable at extracting the portable lesson from how they got there.

Reflection needs machinery, not theater

If there is a more intelligent path forward, it probably looks less like adding bigger rewards and more like building better postmortems.

A promising family of ideas tries to separate search from learning more explicitly. Let the model explore. Then review the trajectory afterward. Identify the decisive transitions. Rewrite the solution into a cleaner version. Train on that edited artifact rather than on the raw rollout. Some recent work gestures in this direction through hindsight relabeling, synthetic reflections, or reviewer models that critique and compress reasoning traces.

The core intuition is strong. Success should not merely reinforce what happened. It should trigger an attempt to infer what should have happened.

That sounds almost embarrassingly obvious, which is often a sign you are near a real bottleneck. The difficulty is making the review process reliable enough that it adds signal instead of theatrical prose. A model that hallucinates elegant lessons from messy trajectories can easily become worse. It might explain its successes in fluent but false terms, then train itself on those fictions.

There is also a scale problem. Frontier labs need methods that work across mathematics, coding, browsing, tool use, and open-ended tasks where the verifier ranges from strict to fuzzy. A review system that only works in crisp symbolic domains will help, but it will not solve the general problem.

Even so, the direction feels more promising than pretending the endpoint reward is sufficient if we just turn the sampling crank harder. Smarter learning will probably require some combination of richer verifiers, selective memory, trajectory editing, and explicit models of which decisions were causally important. The common thread is simple: the system needs a way to learn from experience without treating the whole experience as equally informative.

The bottleneck is still credit assignment

There is a seductive story around reasoning models that says scale solved the main conceptual problems and left us with engineering. Better data, larger clusters, stronger verifiers, more rollouts, done. That story misses the awkward center of the picture.

We can now generate very long, very detailed traces of thought-like behavior. We still struggle to say which pieces of those traces should become lasting habits.

That is why Karpathy’s straw metaphor lands so hard. The issue is not merely that supervision is scarce. It is that the system produces a feast of behavioral detail and then learns from it through a tiny aperture. Until that changes, reinforcement learning on language models will remain a strangely primitive teacher: effective in flashes, noisy by construction, and much worse at learning from success than the outputs sometimes make it appear.