The Cognitive Core and the Case for Forgetful Models

Children are terrible archivists. They forget names, blur timelines, and lose half of yesterday before lunch. Yet they learn language, social rules, physics, and causality with an efficiency that still makes our best models look strangely brittle.

That contrast is not a cute metaphor. It points at a real technical problem. Large language models may remember too much of the wrong kind of thing, and that memory may be blocking the kind of learning we actually want.

A human child cannot hear a random sequence once and repeat it perfectly. A modern model often can, or can get disturbingly close after very little training. Andrej Karpathy has used this contrast to make a sharp point: our bad memory is useful because it pushes us toward the general part of an experience. We compress. We keep the pattern and drop the noise. A model with enough capacity can keep both.

That sounds like a superpower until you ask what sort of intelligence it produces.

Learning through loss

Infantile amnesia looks like a design flaw. Most adults remember almost nothing from early childhood, and even later memories are partial, reconstructed, and often wrong in the details. Human memory is less like a hard drive and more like a sketchbook that someone keeps erasing and redrawing.

Yet that same system is astonishing at abstraction. Children do not memorize every sentence they hear. They infer grammar from fragments. They do not store a clean table of physical laws. They learn that unsupported objects fall, hidden objects still exist, and faces carry intention. They are constantly forgetting particulars and keeping structure.

There is a computational advantage hidden in that messiness. If you cannot preserve every detail, you are forced to search for reusable rules. Memory becomes a bottleneck, and the bottleneck acts as a teacher. It asks, in effect, what is worth keeping because it will help again.

Training a language model does not impose the same pressure. The objective is simple: predict the next token over a giant corpus. If the corpus contains recurring facts, idiosyncratic phrases, formatting debris, copyrighted paragraphs, stock tickers, broken HTML, and every flavor of internet sludge, then one path to lower loss is to memorize surprising chunks of it. Not always verbatim, not uniformly, but enough that the model's weights become a lossy compression of the dataset itself.

That phrase matters: a lossy compression of the dataset. We often talk as if the weights are pure reasoning machinery. They are not. A large share of them is serving as a weird, blurry storage layer for everything the model was rewarded for not forgetting.

Memorization is useful until it hijacks the system

This is where the usual intuition gets slippery. Memorization is not inherently bad. You want a model to know syntax, common facts, programming idioms, and the statistical regularities of language. Without some durable internal trace, there is no competence at all.

The problem is proportionality. A system that can memorize arbitrary noise too easily will spend capacity on noise whenever the training objective permits it. Give it a random string, train briefly, and it can learn to reproduce the string with a speed no person can match. That is impressive in the way a warehouse full of filing cabinets is impressive. It says something about storage and retrieval. It says less than we pretend about understanding.

Humans are protected from this trap by their own limitations. A person exposed to one bizarre sequence is unlikely to preserve it exactly. The system throws most of it away. That sounds inefficient, but it creates a bias toward what transfers. A child hears many verbs used in many forms and extracts an underlying pattern. A model can also learn that pattern, but it has less incentive to discard the accidental details attached to each example.

This may help explain a familiar frustration with language models. They can look fluent across a wide range of tasks, then fail in ways that feel embarrassingly literal. Move slightly off the distribution of their training data, or ask them to reason under a new constraint, and the confidence remains while the generalization wobbles. The model has absorbed oceans of text, but absorption is not the same as building compact internal procedures that survive novelty.

People sometimes describe this as a reasoning gap. Part of it may be a memory allocation problem.

The case for a cognitive core

Karpathy has floated a prediction that sounds almost absurd in an era of giant models: in twenty years, a highly useful conversational system might run on something like one billion parameters. Not because capability has stalled, but because we may eventually separate the part that thinks from the part that stores.

That smaller core would not carry an encyclopedia inside itself. It would carry procedures. How to decompose a task. How to notice uncertainty. How to run a simple experiment. How to call a tool, verify a result, and revise a plan. In other words, the durable machinery of problem-solving, without the burden of pretending to personally remember the whole internet.

This idea becomes clearer if you imagine the model less as a scholar and more as a lab assistant. You do not need the assistant to have every paper memorized. You need the assistant to know how to look things up, compare sources, test a hypothesis, and keep track of the current state of work. A good assistant who knows when to check is often more useful than a confident one with a head full of stale facts.

There is a humility built into that architecture. A model stripped of encyclopedic memory would be forced to represent ignorance more explicitly. It would have to say, in effect, I do not know that, but I know how to find out. That is a more mature stance than the current default, where the model often produces a plausible answer because plausibility is what its training objective has paid for.

A one-billion-parameter core is not a serious proposal if it must also encode half of Wikipedia, legal doctrine, medical edge cases, software documentation, and every obscure fandom wiki. It becomes plausible only if those facts live elsewhere and are fetched when needed. Then the compact model is free to specialize in coordination rather than storage.

Most training data is compression bait

This is the unfashionable part. A lot of the internet is junk from a learning perspective.

Not all junk is useless. Even messy text teaches formatting conventions, genre, tone, and background world knowledge. But pretraining corpora also contain endless low-value tokens: boilerplate, duplicate pages, machine-generated spam, malformed tables, tracking strings, fragments of code that teach no principle, and documents that are informative only because they happened to exist in bulk. When a model is trained on that mixture, it is rewarded for building an internal compression scheme for the whole landfill.

That changes how we should think about scale. Bigger models are often framed as more intelligent, but some of their additional parameters may simply be renting storage space to the dataset. The model needs room for millions of brittle associations because the corpus is full of brittle associations.

Clean the data aggressively and the required size may drop by more than many people expect. If the training set contains fewer arbitrary identifiers and fewer redundant low-signal documents, then memorizing them becomes less useful. The optimization pressure shifts. Capacity can be spent on robust abstractions rather than archive duty.

This does not solve everything. High-quality data is expensive, culturally narrow if curated carelessly, and harder to define than people admit. A corpus made only of polished textbooks would produce its own distortions. Human intelligence is shaped by messy environments too. Still, there is a large difference between productive mess and statistical sewage, and today we often train on both as if quantity alone will sort it out.

Clear context, hazy weights

One of the most helpful ways to think about model memory is to split it in two.

There is the context window, implemented through the KV cache during inference. That is working memory. Put a document in front of the model and it can use it with astonishing precision. Ask it to quote, compare passages, obey a formatting rule, or update an answer based on a fresh paragraph, and it often performs far better than when relying on what is buried in its parameters.

Then there are the weights themselves. That is long-term memory in a much stranger sense: distributed, compressed, and difficult to inspect. Facts stored there do not sit as crisp entries. They survive as statistical traces. Sometimes the trace is strong enough to feel exact. Often it is fuzzy, blended with neighbors, or distorted by later training.

That difference should reshape product design. If what is in context is sharp and what is in weights is hazy, then we should stop demanding that one monolithic model be both thinker and library. Give it a reliable working set. Let it retrieve documents, use tools, maintain a scratchpad, and update its plan in public. Keep the internal core focused on operations that benefit from internalization: syntax, planning habits, causal heuristics, code transformations, and the tacit glue that lets an agent move from one step to the next without falling apart.

Humans do something similar, though with more drama and worse search interfaces. What you just read is available in detail for a while. What you read a year ago survives as an impression, unless you revisit it. Deep expertise often comes from repeated retrieval and use, not from one perfect act of storage.

Intelligence may need engineered forgetfulness

There is a temptation to treat forgetting as loss and memory as capability. That framing is too simple. Some forgetting is how a system protects itself from drowning in particulars. It is how patterns become visible.

The path forward may involve building models that are deliberately worse at retaining arbitrary detail inside their weights, while becoming better at maintaining local context and reaching external knowledge. That would feel like a step back if you measure greatness by how much text a model can recite from its training set. It would look like progress if you care about systems that adapt, check themselves, and generalize cleanly.

The strange possibility is that machine intelligence improves when we stop asking it to be a museum. The future may belong to models with smaller cores, cleaner priors, better tools, and less internal clutter from the internet's attic. Children do not learn because they remember everything. They learn because they cannot, and the world keeps forcing them to infer what matters.