The Word from 1984 That Still Explains AI

In a 1984 episode of The Computer Chronicles, Nils Nilsson used a word that refuses to age: brittle.

He was talking about expert systems. Their knowledge looked impressive until reality bent a little. A child could handle the missing context. The machine could not. Four decades later, an arXiv paper shows modern language models stumbling on a task as simple as deciding whether “apple” belongs to the set {pear, plum, apple, raspberry}. The hardware changed. The training stack changed. The weakness is still sitting in the middle of the room.

What shook me in those archive clips was not nostalgia. It was recognition. Nilsson and John McCarthy were already describing the same wall builders keep hitting now, even with models that can write code, draft legal summaries, and fake confidence better than many humans with actual jobs.

The archive does not feel old

I expected the usual museum-piece experience: chunky terminals, ambitious demos, and a comfortable sense that we had moved far beyond the constraints of that era.

Instead, the conversation felt current in a slightly unnerving way.

Nilsson laid out the problem with rare precision. First, you need to decide what knowledge matters. That sounds obvious until you try it. People carry around huge amounts of implicit understanding and rarely notice it. Second, you need a way to represent that knowledge in a machine. Some facts fit neatly into rules or labels. A lot of useful knowledge does not. Third, you need a way to use that knowledge in context. This is where clean diagrams usually stop helping.

That structure still holds. It describes why so many AI systems look solid in a demo and unstable in production. The issue is not just model quality. It is the mismatch between what humans know, how vaguely we often know it, and what machines can reliably operationalize.

McCarthy sharpened the point with an example from medicine. You can define a sterile container as one in which all bacteria are dead. That definition is easy to encode. It is also not how humans work with the concept. We heat the container. We culture samples. We observe that colonies do not appear. We infer sterility from indirect evidence and background assumptions. We use a concept without reducing it to its literal form.

That kind of knowledge is slippery in a productive way. It is not fully formal, yet it remains reliable in practice. McCarthy’s line was devastating because it did not sound like a temporary engineering complaint. It sounded like a category problem. Some kinds of knowledge are used vaguely, and that vagueness is part of their utility.

We still do not know how to build that well.

Brittleness changed shape

The modern version looks different from the old expert-system failure mode, which is why people often miss the continuity.

Expert systems broke because their rules were too rigid. Language models break because their flexibility is not the same thing as grounded understanding. They can absorb astonishing amounts of pattern, then fail on a variation that a human barely registers as different. The result feels less mechanical and more uncanny, which can fool you into trusting it longer than you should.

A recent audit of an automated customer-request triage system made this painfully concrete. On benchmark data, the system looked excellent. Three weeks into production, the error rate had climbed to roughly 15 percent. The main causes were mundane: regional phrasing, misspellings, rewritten requests, and odd combinations of otherwise familiar terms. The model had learned the neighborhood of its examples. It had not learned the territory.

This is where a lot of discussion becomes too lazy. People say current models are “bad with typos” or “bad with messy language” as if that were the central issue. In many cases, it is not. Modern models are often surprisingly tolerant of surface noise. They can recover from misspellings better than older NLP systems ever could.

The deeper problem is thinner and more interesting: when the context is sparse, underspecified, or poorly structured, the model starts filling the gaps with statistical habit. It reaches for the nearest plausible pattern, then presents that guess with the smoothness of someone who has never once met an edge case. Give the same model dense, well-architected context and its behavior often becomes dramatically more stable. Same weights. Same prompt class. Different informational scaffolding.

That is not a small implementation detail. It changes how you think about reliability. The model is not simply a brain in a box. It behaves more like a reasoning layer strapped onto an information environment. If that environment is thin or noisy, the reasoning degrades fast.

The set-membership paper lands because it exposes this gap cleanly. These models can sometimes solve tasks that look sophisticated, then wobble on a primitive one. That tells you something important. Success on a hard-looking problem does not prove deep competence. It may only prove that the problem resembles something the model has seen often enough, in forms it can compress.

Gary Marcus has been making a version of this criticism for years. You can fix individual failure cases one by one. New ones keep appearing because the underlying robustness problem is still open. That diagnosis can be overstated in some debates, but it remains useful. Patchwork improvement is real. General reliability is another matter.

The market rewards workarounds

Nilsson made an economic observation in 1984 that sounds even sharper now. There is limited commercial appetite for encoding everyday common sense. There is plenty of appetite for encoding expertise that unlocks clear revenue.

That asymmetry explains a lot.

Money flows to code generation, document analysis, customer support tooling, search assistants, research copilots, domain-specific agents, and compliance workflows. These systems matter. Many are genuinely useful. But look at how they become useful. Teams do not solve common sense in the abstract. They narrow the task, constrain the environment, add retrieval, specify the schema, tune the prompt, insert checks, and route uncertain cases to a human.

In other words, the market pays for ways around the old problem.

There is nothing irrational about this. If you can build a system that helps lawyers review contracts faster, nobody demands a philosophical breakthrough on how machines internalize everyday causality. They want billable hours reduced by Thursday. Progress arrives as scaffolding, not purity.

That is also why some of the most promising technical ideas feel like partial strategies rather than final answers. Retrieval grounding helps because it anchors outputs in verifiable sources. Neuro-symbolic methods help because explicit structure can correct some statistical drift. Reasoning traces can help when decomposition is the missing piece. All of these improve real systems. None fully resolves the problem McCarthy described: how to represent and use the vague, background knowledge humans rely on constantly without even noticing.

Reliability comes from architecture, not admiration

This historical perspective changed the way I think about building with language models.

Benchmarks matter, but they mostly tell you about the center of the distribution. Production lives near the edges. Users rephrase things badly. They omit crucial details. They combine intents. They bring local slang, partial facts, weird timing, and contradictory evidence. A model that looks polished on a test set can start leaking errors as soon as it meets actual people, which is a very expensive way to learn humility.

The strongest systems I see now are hybrid by design. The model handles interpretation, synthesis, and flexible language. Explicit business rules handle hard constraints. External tools verify facts, perform calculations, or retrieve grounded context. Humans review the cases where cost of failure is high and ambiguity is persistent. This is less elegant than the fantasy of one model doing everything. It is also much more dependable.

That shift matters psychologically as much as technically. Many teams still approach the model as if reliability will emerge once the prompt is polished enough. Sometimes it does improve. Often you are using prompt craft to compensate for a missing systems decision. The answer is not another adjective in the instruction block. It is better context design, narrower scopes, and failure pathways that assume the model will drift.

The best question is no longer “How smart is this model?” It is “Under what conditions does this system remain predictable?” Those are different questions. The first flatters the technology. The second protects the user.

Progress without common sense is still progress

None of this means AI is stalled. That would be absurd. The capability gains are real and in some domains extraordinary. A model that can summarize a dense contract, translate across languages, explain a codebase, and draft a passable memo would have looked miraculous in 1984.

But capability is not the same as robustness, and fluency is not the same as judgment.

The archive matters because it strips away the novelty effect. It reminds us that some of the deepest problems were visible before the current wave had a name, before GPUs became destiny, before product teams learned to say “agentic” with a straight face. Nilsson and McCarthy were not describing a bug in a particular stack. They were describing a stubborn mismatch between human knowledge and machine representation.

That mismatch does not make current systems useless. It makes them legible. Once you see brittleness as a structural property rather than a temporary embarrassment, a lot of design decisions become clearer. You stop asking the model to be a general judge of reality. You give it richer context, narrower authority, and real checks. You stop waiting for one more scale jump to transmute plausibility into understanding.

Forty years later, the surprising part is not that machines still struggle with common sense. The surprising part is how much value we can extract anyway, provided we stop pretending the old problem disappeared.

Sources

The Computer Chronicles: Artificial Intelligence (1984), with Nils Nilsson, John McCarthy, and Edward Feigenbaum.
“On the Brittleness of LLMs: A Journey around Set Membership” (arXiv, 2025).
Gary Marcus, “AI still lacks common sense, 70 years later.”