AGI Is Not a Threshold: Shane Legg’s Test Changes the Debate

The most misleading image in AI is a finish line.

People keep talking about AGI as if a starter pistol fired in secret, and one morning the industry will announce that the machine has crossed a bright red line. Before that day, no AGI. After that day, history splits in two. It is a clean story, and it is probably the wrong one.

In a recent Google DeepMind podcast episode, Shane Legg argues for something much more useful. Stop treating AGI as a binary event. Treat it as a graded capability, then define tests that force the argument out of vibes and into evidence. That shift sounds technical. It is also political, economic, and strangely clarifying.

It clarifies why smart people can look at the same systems and produce wildly different timelines. They are often talking about different thresholds while using the same word. One person means “economically valuable autonomy in white-collar work.” Another means “typical human competence across almost all cognitive tasks.” A third means “Einstein plus Mozart plus better management skills than your boss.” Put all that under one acronym and confusion is guaranteed.

Legg’s framing does not solve every problem. It does something better. It turns a foggy cultural argument into a question of measurement, failure modes, and social preparation.

A human bar, not a heroic one

Legg’s starting point is intentionally modest. His “minimal AGI” is not a machine that beats the best human at everything. It is an artificial agent that can do at least the kinds of cognitive tasks humans can typically do.

That word, “typically,” carries most of the weight.

If the bar is too low, then almost any polished chatbot starts to count. If the bar is too high, the definition becomes absurdly elitist. You end up demanding that AGI compose symphonies, solve frontier physics, and maybe renovate your kitchen without forgetting the backsplash. Humans themselves are not like that. Most people have broad competence with sharp limits. A useful definition should reflect that.

So the reference class is not excellence. It is normal human cognition across a wide range of tasks. The system should not fail in ways that would surprise you from an ordinary person. If a typical adult can navigate the task, reason through it, or learn it in context, then a minimally general system should too.

This is more disciplined than the benchmark culture surrounding AI. Benchmarks reward local victories. A model crushes a coding suite, aces a graduate exam, or writes fluent prose in dozens of languages, and people rush to declare a new era. Legg’s definition resists that impulse. Generality means you do not get to cherry-pick the domains where the machine shines and ignore the odd places where it still behaves like a very convincing intern who has never seen a stapler.

That sounds stricter because it is stricter. It also makes the concept less mystical. AGI becomes a claim about coverage and reliability, not about aura.

The uneven shape of current progress

This framing helps explain a fact that often gets lost in public debate: current systems are already beyond humans in some narrow or semi-broad dimensions, while still falling short in places that feel almost embarrassingly basic.

Legg’s answer to the old “sparks of AGI” language is basically that the sparks metaphor is too small. These systems can already do things no human can do in aggregate. They can use huge numbers of languages. They can recall obscure facts from a distribution no person has time to absorb. They can compress gigantic swaths of public knowledge into something like immediate access. On those dimensions, comparing them to a single human is almost silly.

And yet the same systems still stumble in ways people find disorienting. They struggle with continual learning across long periods. They can be brittle with visual-spatial reasoning. Diagrams, perspective, counting relations in a graph, following structure through a messy representation—these are all areas where humans often outperform them without noticing they are doing anything special.

That combination is why the discourse gets weird. A system can look superhuman in broad language use and subhuman in the cognitive equivalent of not understanding how chairs work. People then project their preferred narrative onto the mismatch. Optimists see temporary rough edges. Skeptics see a fundamental ceiling. Legg’s position is more restrained. He does not treat these gaps as proof of impossibility. He treats them as part of the long tail of human cognition that still has to be covered.

That distinction matters. If the problem is a long tail, the path forward is engineering, data, architecture, and testing. If the problem is a deep conceptual wall, the entire roadmap changes. Legg is clearly on the first side of that divide, though not in a simplistic “just scale harder” way.

The roadmap hiding inside the definition

One useful feature of operational definitions is that they quietly reveal a technical agenda.

If minimal AGI means broad, human-typical competence, then raw parameter count is not enough. More data helps, but it has to be the right kind of data. If models remain weak at visual reasoning, embodied interaction, or long-horizon learning, then feeding them more text scraped from the internet will not magically close the gap. You need training signals that actually exercise those faculties.

Legg also points toward architectural change, especially around memory and continual learning. That is important because today’s dominant systems are excellent at compressing prior training into weights, but much less elegant when it comes to incorporating new knowledge over time without retraining or awkward scaffolding. Humans are different. They learn on the job. They update from experience. They store episodes, retrieve them when relevant, and gradually fold repeated patterns into skill.

That suggests a hybrid future. A large underlying model does the general pattern processing. On top of it sit memory systems, retrieval layers, control loops, and mechanisms for storing new experience without melting old capabilities. Over time, some of that experience may be distilled back into the core model. This is not as cinematic as a single giant net waking up. It is more like building a mind out of cooperating subsystems, each compensating for the others’ weaknesses.

The public conversation still treats “bigger model” as the central story because size is easy to see. The harder truth is that reliability across the full spread of human cognition probably depends on system design, not just scale curves. Memory, tool use, verification, planning, and adaptation are not decorations. They are the infrastructure of generality.

Levels matter because timelines depend on them

Once you think in levels, timeline arguments stop sounding quite so incompatible.

Legg distinguishes minimal AGI from fuller forms. Minimal AGI covers the range of tasks a typical human can generally handle. “Full AGI,” as he sketches it, reaches across the full spectrum of human cognitive potential, including the rare peaks. That means not merely functioning like an ordinary adult across domains, but matching the kinds of exceptional outputs that define human genius at the edge.

Beyond that lies superintelligence, which is harder to pin down with precision but easier to motivate. Human brains are not sacred ceiling devices. They are constrained biological hardware running on about 20 watts, with slow signaling and all the usual compromises of evolution. There is no obvious reason to assume that our current cognitive range is the maximum achievable by physical systems. Data centers do not share our energy budget or our skull size.

You do not need to accept every fast-takeoff fantasy to accept that point. A machine can surpass people in general cognitive effectiveness without arriving in some thunderclap. It can happen through widening margins in domain after domain, then through improved integration, reliability, and speed. By the time the label debate settles, the practical consequences may already be everywhere.

This is also why public arguments about “AGI in three years” versus “AGI in twenty years” are often so slippery. The speakers may be anchoring to different levels. Three years for economically transformative agents in software or customer support may coexist with twenty years for robust, human-typical competence across every corner case of cognition. Both predictions can sound precise while referring to different destinations.

A test instead of a slogan

The strongest part of Legg’s proposal is not his vocabulary. It is the test.

Phase one is a battery of cognitive tasks with a human baseline. The machine should perform at the level of a typical human across the suite. The spirit is strict. A single surprising failure on a task ordinary people handle should count against the claim of minimal AGI. This is not because one miss proves the system is useless. It is because generality means the weird misses matter. They reveal holes in coverage.

Phase two is the part that makes the idea serious: adversarial red teaming. Give an expert group time—weeks or months—to probe the system, search for counterexamples, and identify a cognitive task that ordinary humans can do but the model still fails. In other words, do not just let the builder choose the exam. Let motivated skeptics hunt for the crack.

That changes the conversation in a few ways.

First, it shifts attention away from checklists. A model can cook from recipes, write code, pass law exams, and operate a browser, yet still fail in ways that expose a lack of genuine breadth. A curated checklist is useful for marketing and fundraising. It is much less useful for claims about general cognition.

Second, it acknowledges that intelligent systems fail adversarially. The relevant question is not whether a model performs well on tasks its creators expected. The relevant question is whether smart opponents can find a domain where it collapses while humans stay stable.

Third, it makes AGI legible to outsiders. Regulators, firms, and the public do not need a grand unified theory of intelligence. They need a process for evaluating claims. A transparent task suite plus adversarial probing is imperfect, but it is much closer to an institution than a slogan.

There are practical difficulties, of course. Constructing a representative suite is hard. Defining “typical human” performance is messier than it sounds. Red teams can overfit their own cleverness, and model builders can tune systems to the exam. If inspectors get privileged internal access, governance questions multiply. Still, this is the right kind of difficulty. It is the difficulty of measurement, not the fog of metaphysics.

Ethics gets stranger when capability generalizes

Legg’s comments on ethics are easy to underrate because they sound narrower than the AGI headline. They are not narrow. They point toward the same deeper issue: once systems become broadly capable, safety cannot rely on simple reflexes.

He uses the language of “system two safety,” borrowing from the distinction between fast intuition and slower deliberation. The idea is that safe behavior cannot be reduced to one-line commandments. “Never lie” breaks instantly in edge cases. “Always help” is no better. Real moral reasoning involves conflicts, context, consequences, and tradeoffs between values that do not fit into bumper stickers.

A capable model might therefore need to reason about ethics rather than merely imitate approved responses. In principle, that could make it more consistent than humans. People are famously bad at applying their own values evenly. We rationalize. We bend to incentives. We get tired, tribal, scared, or vain. A system able to evaluate consequences coherently and apply principles at scale could outperform us morally in some settings.

That is the optimistic version. The catch is grounding. Current models do not inhabit the world as humans do. They ingest traces of human experience without being located inside bodies, families, and material constraints in the same way. That makes moral competence both easier to simulate and harder to trust. A beautifully reasoned answer about harm is not the same thing as a robust understanding of harm in practice.

Legg also raises the idea of monitoring reasoning traces, often discussed as chain-of-thought monitoring. The attraction is obvious. If a model explains its reasoning, maybe we can inspect it for dangerous plans or subtle failures. The problem is faithfulness. The visible reasoning may not be the actual driver of behavior. You can learn a lot from traces, but only if they correlate with the mechanism that matters.

This is one of those places where progress in capability and safety are entangled. A system that reasons more explicitly may be safer in some situations and more capable of strategic deception in others. That is not a paradox. It is what happens when the thing you are building starts to look less like autocomplete and more like an agent navigating incentives.

The real bottleneck is no longer intelligence

The deepest point in Legg’s argument arrives after the definitions.

If cognitive labor becomes abundant, the core economic bargain of modern societies starts to wobble. For a long time, access to resources has been tied to selling labor, including knowledge work. You study, credential, specialize, and trade your cognitive time for income. If systems can perform a widening share of that work cheaply, at speed, and eventually with agency, then abundance in intelligence does not automatically mean abundance in livelihood.

This is why the fixation on the arrival date is too shallow. The key question is not the ceremonial moment when a system earns a label. The key question is what social architecture exists when high-value cognition becomes cheap.

The first pressure will likely hit laptop-only work because software agents can enter those workflows before robotics gets affordable and reliable everywhere. That does not mean plumbers or nurses are magically safe. It means the cost curve for replacing them includes hardware, physical reliability, regulation, and real-world messiness. A browser agent can be deployed by lunch. A robot that can handle old pipes in a cramped apartment without flooding the building is a different engineering problem.

Even so, the pattern is clear. Production can rise while wage distribution weakens. The pie can grow while access to the pie becomes more concentrated. If intelligence is abundant but ownership of the systems remains narrow, then the social effect is not universal leisure. It is bargaining power moving uphill.

That is why Legg’s framing matters far beyond the lab. Once measurement gets sharper, the political problem gets sharper too. Education policy, tax policy, antitrust, labor law, procurement, data governance, benefit systems, and public infrastructure all become part of the same story. If societies wait for perfect consensus on AGI definitions before redesigning institutions, the redesign will happen under panic and private leverage.

The odd comfort of threshold thinking is that it lets everyone postpone this conversation. If AGI is a single dramatic event, then maybe it is still far away, and maybe someone else will ring the bell when it matters. A continuum removes that excuse. The transition can be economically decisive before the terminology stabilizes.

A better question to ask

Legg’s contribution is not that he solved intelligence. It is that he made the debate less mystical and more accountable.

Treating AGI as levels instead of revelation gives people a common frame for disagreement. A strict definition based on typical human cognition prevents the label from collapsing into marketing. A two-phase test with adversarial probing shifts attention from impressive demos to robust generality. And once you accept that intelligence may become abundant by degrees, the center of gravity moves from prediction theater to institutional design.

The practical question is no longer whether one acronym has officially arrived. It is whether the systems already entering the economy are broad enough, cheap enough, and agentic enough to break the old link between valuable cognition and human wages before society has built anything sturdier in its place.