AlphaFold Predicts the Shape, but Misses What Matters

AlphaFold gave biology a new kind of confidence. Then a tiny mutation exposed how narrow that confidence really was.

George Church likes to point to a simple failure case. Take a serine protease, one of the classic enzyme families in biology. Replace the catalytic serine with alanine. AlphaFold will still predict essentially the same folded structure, down to a fraction of an angstrom in overall error. Yet the protein is dead. It looks right and does nothing.

That example matters because it cuts through the mythology. AlphaFold solved a monumental problem: given an amino acid sequence, predict the 3D arrangement of the protein backbone and side chains with striking accuracy in many cases. The public story turned that into something broader. If we can predict structure, people assumed, then surely we are closing in on understanding function, disease, and design.

Sometimes we are. But that leap is much larger than it sounds.

A catalytic residue swapped for a chemically blander one does not usually blow up the entire fold. The scaffold survives. The active site geometry may even look mostly intact at the resolution most models care about. What disappears is the reason the protein exists. The difference between serine and alanine is a single oxygen atom. For the enzyme, that oxygen is the point.

The seduction of structural accuracy

AlphaFold’s success arrived with numbers that are unusually legible for machine learning. People could compare predicted coordinates against experimentally determined structures and get root-mean-square deviations, confidence scores, clean visual overlays. The model produced something scientists could look at, rotate, and trust.

That mattered. Much of AI still lives behind slippery benchmarks and vibe-based demos. AlphaFold gave researchers concrete geometry. It felt like reality because, in a narrow sense, it was.

The problem is that structure is only one layer of biological truth. A protein is not a museum object. It is a physical participant in a messy environment: folding, breathing, colliding, binding, catalyzing, getting modified, getting degraded, sometimes misfolding, often depending on partners, pH, ions, membranes, and timing. A static structure captures one state or an average of states. Biology happens in transitions.

This is not a criticism of AlphaFold so much as a criticism of how quickly we wrapped a larger story around it. The model predicts the probable fold of a sequence under implicit assumptions learned from known proteins. It does not directly model catalytic chemistry. It does not simulate electron rearrangements in an active site. It does not tell you whether a binding pocket opens only in a rare conformation. It does not tell you whether a mutation preserves the silhouette while destroying the mechanism.

A good structural map is valuable. It just isn’t the territory where life actually runs.

The single atom that kills the enzyme

The serine-to-alanine example is a brutal little tutorial in biochemistry. Serine proteases rely on a catalytic triad, usually serine, histidine, and aspartate, arranged so the serine’s hydroxyl group can act as a nucleophile and attack a peptide bond. Swap serine for alanine and you remove that hydroxyl. The overall protein can still fold beautifully. The active site pocket can still sit in the same place. The substrate may still bind. But the chemistry stalls because the business end is missing.

This is why Church’s point lands so hard. The model gets the pose and misses the act.

You can think of it like rendering a spark plug engine with immaculate CAD precision while forgetting conductivity. Every piston is in the correct place. The housing dimensions are perfect. The part that actually creates ignition is gone, and the machine becomes sculpture.

Structural biology has always known this, at least in principle. Crystallographers, enzymologists, and protein engineers have spent decades learning that tiny local changes can dominate function. Sometimes the key difference is a protonation state. Sometimes it is the orientation of a water molecule. Sometimes a side chain moves only when substrate binds. Sometimes the crucial state exists for milliseconds and never appears in the structure everyone cites.

What AlphaFold changed was the scale and convenience of structural access. That was a gift, but it also made it easier to forget how much function depends on details below the resolution of the story we like to tell.

Function lives in dynamics, context, and chemistry

“Structure determines function” is one of those scientific slogans that starts true and becomes misleading through repetition. Structure constrains function. It often hints at function. It sometimes makes function obvious. It does not determine function in any simple, one-step way.

A protein’s job emerges from several layers at once.

First, there is local chemistry. Catalysis depends on the exact identities of residues, their charge states, their distances, and the surrounding electrostatic environment. A few tenths of an angstrom can matter. So can the presence of a metal ion or a buried water network.

Second, there is dynamics. Many proteins work by changing shape, not by holding one pose forever. Ion channels open and close. Receptors toggle between active and inactive conformations. Enzymes pass through intermediates that a static prediction may flatten away. The average shape can be right while the relevant motion is wrong.

Third, there is context. Proteins live inside cells, not in prediction benchmarks. They bind membranes, cofactors, chaperones, nucleic acids, metabolites, and other proteins. They get phosphorylated, glycosylated, cleaved, oxidized, and localized. A structure predicted in isolation may say little about whether the protein survives, traffics correctly, or participates in the right complex.

Fourth, there is evolution. Natural proteins are not just physically possible objects. They are historical objects, shaped by selection. A mutation may preserve fold but break fitness because it changes regulation, stability under stress, expression levels, or interactions with other molecules. Evolution cares about the whole system. Structural prediction mostly does not.

Once you see those layers, Church’s example stops looking like an edge case. It becomes a warning label.

Protein language models help, but not enough

There is a tempting reply to all this: fine, AlphaFold is a structure model, but protein language models capture evolutionary information directly from sequence. They should close the gap.

They help. They absolutely do. Models trained on massive protein sequence corpora learn constraints that pure physics-free structure prediction may miss. They can identify conserved residues, suggest mutational tolerance, propose novel sequences, and sometimes correlate surprisingly well with experimental fitness assays. They are useful precisely because evolution has already run a gigantic screening program across billions of years.

Still, they inherit limits from the data they consume and the objectives they optimize. A sequence model learns statistical regularities in known proteins. It does not automatically understand chemistry any more than a text model automatically understands combustion. It can infer that certain residues matter because evolution preserved them. That is different from knowing why they matter in a mechanistic sense.

The gap grows wider when you move beyond the standard twenty amino acids. Much of current protein AI is grounded in databases built from natural proteins, natural folds, and canonical chemistry. The next frontier in protein engineering may involve noncanonical amino acids, unusual backbones, synthetic cofactors, and designs evolution never explored. There, the training prior that made these models powerful also becomes a cage.

Church has emphasized this point for years. The future of engineered biology is unlikely to be limited to remixing nature’s existing alphabet. Once you step outside that alphabet, sequence statistics become a thinner guide, and structural familiarity matters less. The very place where design could become most interesting is the place where present AI has the least historical memory.

So protein language models are complementary tools, not a magic bridge from fold to function.

Scientific AI loves what can be benchmarked

Part of the confusion here comes from how modern AI culture rewards measurable victories. Structure prediction had a benchmark. CASP made the field legible. You could compete, compare, and declare progress. That attracts talent, funding, and headlines.

Function is harder.

What counts as a correct prediction of function? Is it catalytic rate on one substrate under one condition? Binding affinity across a family of ligands? Stability in serum? Toxicity in a cell line? Expression yield in a host organism? The answer depends on what you want the protein to do. Biology resists clean scoreboards because the target keeps moving.

That mismatch between benchmarkability and real importance shows up across scientific AI. We often get amazing at proxies before we get competent at outcomes. Image models classify scans before they change patient care. Code models complete functions before they maintain large systems. Structure models predict folds before they tell you which designed enzyme actually works in a flask.

Benchmarks are still useful. Without them, progress becomes folklore. But benchmark success can distort attention. Once a community has a sharp target, the field starts optimizing for the target’s aesthetics. In protein science, a beautiful superposition of predicted and experimental structures is deeply satisfying. It also risks becoming a kind of optical illusion. You see atomic agreement and assume biological understanding has arrived with it.

Sometimes it has. Often it has not.

Biology itself is the better computer

Church’s answer to this limitation is not to give up on models. It is to demote them from oracle to guide.

Instead of trying to simulate every relevant detail, you can generate enormous libraries of variants and test them experimentally. Build billions of sequences. Express them. Screen them under real conditions for the behavior you care about. Use the living system, or at least the biochemical assay, as the compute substrate. Let matter do the calculation.

This sounds old-fashioned until you consider the scale now possible. High-throughput DNA synthesis, sequencing, automated liquid handling, droplet microfluidics, cell sorting, and multiplexed assays have turned experimental biology into a search engine. The lab can evaluate vast design spaces that no human could enumerate manually and no current model can score reliably.

In that setup, AI becomes useful in a more grounded way. It proposes smarter libraries. It narrows the search. It identifies regions of sequence space likely to contain working candidates. It helps interpret noisy results and updates the next round of designs. But the decisive step is empirical. The world answers.

That matters because function is not an abstraction. A protein either cuts the substrate, binds the target, survives the cell, or it does not. There is no need to infer reality when you can measure it at scale.

Church sometimes calls this “natural computation,” and the phrase is apt. Evolution discovered long ago that massively parallel physical experiments beat elegant theories when the space is too large and the interactions too nonlinear. Modern synthetic biology is rediscovering the same lesson with better instrumentation and more intentional goals.

The irony is almost comic. After years of saying biology would become more like software, the deepest progress may come from remembering that cells are not software at all. They are wet, stubborn, and extremely good at revealing whether your model was flattering itself.

The shift from prediction to closed-loop discovery

The practical future is not “AI versus experiment.” It is a tighter loop between them.

Imagine designing an enzyme for a new industrial reaction. A structure model helps generate plausible folds. A sequence model suggests mutations that look evolutionarily sane. A generative model explores variations humans would never think to try. Then thousands or millions of candidates go into an assay. The data returns, and the models retrain on what actually worked under the desired conditions.

That loop is more expensive than pure prediction and much more powerful. It turns static inference into adaptive search. It also changes what counts as intelligence in science. The valuable model is not the one that seems most omniscient in a paper figure. It is the one that improves the next experiment fastest.

This is already happening in fragments across antibody engineering, enzyme optimization, regulatory element design, and variant effect mapping. The bottleneck is no longer just computation. It is the interface between bits and biology: synthesis costs, assay quality, throughput, and the often unglamorous logistics of generating clean labels from living systems.

Seen that way, AlphaFold was both a breakthrough and a distraction. It solved a visible subproblem so elegantly that many people started treating the rest as a downhill walk. The harder territory begins after the fold is known. You still have to understand action, selectivity, robustness, manufacturability, and safety. You still have to deal with the fact that proteins are components in systems, not isolated sculptures.

That is a more humbling picture of scientific AI, but also a healthier one. The models are extraordinary. They are not replacing the experiment. They are changing its economics.

What this says about intelligence in science

There is a broader lesson hiding inside the dead serine protease. AI is often strongest where the world offers dense historical data and a stable target. It is weaker where success depends on sparse signals, unusual contexts, and properties that emerge only through real interaction with matter.

Science, unfortunately for neat demos, has a lot of the second kind.

When we celebrate an AI system for scientific discovery, we should ask a plain question: what exactly did it predict, and what remains unmeasured? In protein biology, predicting a fold is an extraordinary achievement. It is also a reminder that legibility is not completion. A structure is easier to score than a function. That does not make it more important.

The next wave of progress will probably come from people who treat models as instruments, not authorities. They will use prediction to compress the search space, then hand the decisive questions back to chemistry and cells. That approach is less cinematic than the dream of a model that simply knows. It is also closer to how difficult sciences usually move forward.

AlphaFold did not fail because it misses function. It was never built to guarantee function. The mistake was ours when we let structural brilliance stand in for biological understanding. Church’s example is useful because it snaps the spell. If a protein can look perfect and still be useless, then the center of gravity shifts back to experiment, mechanism, and the stubborn physical details that machine learning still tends to smooth away.

That is not a disappointment. It is the beginning of a more serious relationship between AI and biology.