Biology Built Attention Long Before Transformers

Attention can feel like a software trick. A clever matrix operation, discovered by engineers, then overextended into a theory of everything. That story gets weaker the moment you look at gene regulation.

Jacob Kimmel once offered a deliberately cringe analogy: transcription factors are the queries, the DNA sequences they bind are the keys, and genes are the values. It sounds like something you would regret posting after midnight. It also points at a deep structural rhyme between modern machine learning and biology.

The claim is not that cells run transformers in wetware. They do not. The claim is narrower and more interesting: when a complex system needs to select the right action from a huge space of possibilities, under tight constraints, it often lands on a pattern that looks a lot like attention. Something emits a selective signal. Something else exposes matchable features. A downstream effect gets routed only where the match is strong enough and the context allows it.

That is not a branding exercise for AI. It is a hint that attention may be less like an invention and more like a recurring answer to a hard problem.

A query-key-value pattern inside the genome

Start with the concrete biology.

A transcription factor is a protein that binds particular DNA motifs and changes gene expression. Some transcription factors activate genes. Some repress them. Many do both, depending on partners and context. They do not control a single target in isolation. One factor can influence hundreds or thousands of loci across the genome.

The rough mapping looks like this:

| Transformers | Biology | | --- | --- | | Query | Transcription factor state | | Key | DNA binding motif plus local context | | Value | Gene expression outcome |

That first row needs one tweak. A transcription factor is not just a static query. Its effective query is shaped by its concentration, chemical modifications, binding partners, and the accessibility of nearby chromatin. In other words, the cell is not asking, “Does this protein exist?” It is asking, “Given the cell’s current state, where should this regulator land, and what should happen if it does?”

That already feels familiar. In a transformer, a query vector does not matter by itself. Its effect depends on the keys available and the values attached to them. In a cell, a transcription factor does not matter by itself either. It only becomes consequential when it finds compatible motifs in accessible regions and recruits the machinery that shifts transcription.

There is even a rough analogue of attention weighting. In a language model, relevance emerges from similarity scores, then gets normalized into weights. In biology, the math is not a clean dot product, but binding affinity, motif quality, local chromatin state, and cooperative interactions together determine how strongly a site is occupied and how much regulatory effect it carries. Weak site, closed chromatin, wrong cofactors: no output. Strong site, open chromatin, supporting factors nearby: the gene’s probability of expression changes.

A genome is full of latent keys. Most are ignored most of the time. That selectivity is the entire game. Every cell in your body carries almost the same DNA, yet a neuron and a liver cell behave like residents of different planets. The difference comes largely from which regulators attend to which regions, in which combinations, for how long.

This is why enhancer biology feels so modern when described in computational language. Enhancers are not simple on-off switches. They are integration surfaces. Multiple transcription factors can bind nearby, compete, cooperate, and condition one another’s effects. The gene downstream behaves less like a hardcoded instruction and more like a context-sensitive readout.

The parallel is not perfect, but it is structurally strong enough to be useful.

Evolution likes leverage

Why would biology converge on this kind of architecture?

Because evolution works under severe bandwidth limits. Mutations are rare relative to the size of the search space. Most changes are neutral or harmful. A system that requires thousands of precisely coordinated edits to produce a new phenotype is hard to evolve. A system where small changes can redirect large downstream programs is much more navigable.

Transcription factors provide that leverage.

Instead of encoding each cell type as a separate cassette of effector genes, genomes rely heavily on regulators that can orchestrate broad expression programs. Change the activity of a key regulator, and you do not tweak one output. You tilt an entire landscape.

Developmental biology is full of examples. Hox genes pattern the body plan by regulating many downstream targets. MyoD can push cells toward a muscle program. The Yamanaka factors can help reprogram differentiated cells toward pluripotency. These are not magic incantations. They are high-leverage control points in a regulatory graph.

That leverage is exactly what you would expect from a good search regime. Small edits need to create meaningful phenotypic differences, otherwise selection has very little to work with. A mutation that changes a transcription factor’s binding preference, expression level, or interaction partners can have large effects without rewriting every target gene from scratch.

There is a computational analogy here that goes beyond wordplay. Deep learning also depends on structured leverage. If every meaningful behavior required changing millions of unrelated parameters in uncoordinated ways, gradient descent would crawl. The systems that train well are usually the ones with reusable abstractions, modular interactions, and pathways for small parameter updates to yield coherent changes in behavior.

Biology did not invent gradients in the machine learning sense. Evolution is not backpropagation wearing a lab coat. But both processes reward representations that are easy to adjust locally and consequential globally. Regulatory networks built around transcription factors do exactly that.

There is also an efficiency argument. If every gene had to encode explicit logic for every possible environmental and developmental condition, genomes would become even more unwieldy than they already are. Regulatory hierarchies compress control. A relatively small set of factors can coordinate vast numbers of downstream genes through combinatorial binding.

Combinatorial is the key word. A single transcription factor rarely defines a full cellular identity. It is usually the combination that matters: factor A in the presence of B and C, absent D, with chromatin opened by E, during a particular developmental window. This lets evolution build a huge repertoire from a smaller set of parts, much the way neural networks reuse learned features across many contexts.

Selective addressing solves a universal problem

Once you step back, the common problem becomes obvious.

Complex systems are flooded with possible information. Most of it is irrelevant to the decision at hand. The central challenge is not storing everything. It is deciding what to consult, when, and with what consequence.

Language models face this at the level of tokens and representations. A pronoun may depend on a noun ten words back. A code completion may depend on one function definition and ignore fifty nearby lines. Attention gives the model a way to address relevant context without passing every signal through a fixed bottleneck.

Cells face the same problem in a different medium. The genome contains thousands of genes and a vast number of potential regulatory sites. A cell cannot express everything. It must pull the right subsets into play, based on lineage, signals, stress, nutrient state, circadian timing, and a long list of other conditions. Transcription factors, enhancers, and chromatin states form an addressing system over that latent space.

This is why the analogy feels stronger than a cute comparison. Both systems need sparse relevance over large memory. Both systems benefit from distributed representations that can be flexibly recombined. Both systems rely on context to determine whether a match should have downstream force.

Even the failure modes rhyme. In machine learning, spurious attention can send the model toward the wrong evidence. In biology, misregulation of transcription factors can drive disease, including cancer, developmental disorders, and immune dysfunction. When the routing layer goes wrong, the effects propagate.

There is a lesson hidden there. Intelligence, in the broad sense of adaptive information processing, may depend less on raw compute than on good routing. A pile of stored information is inert until a system can selectively bring the right pieces together.

The biology is messier than the diagram

At this point it is worth protecting the analogy from its own success.

Transformers implement attention through clean linear algebra on discrete timesteps. Cells use chemistry, diffusion, stochastic binding, 3D genome folding, and molecular assemblies that make software diagrams look embarrassingly tidy. A transcription factor does not scan the whole genome in one synchronized pass and produce normalized weights over every site. Much of gene regulation is local, path-dependent, and constrained by physical structure.

Chromatin accessibility matters enormously. Many DNA motifs are present in the genome but effectively hidden. A transcription factor may “know” the key only in principle. In practice, it cannot bind unless nucleosomes are shifted, pioneer factors open the region, or other regulatory events happen first. That means the key is not just the sequence. It is the sequence embedded in a changing physical environment.

The values are messy too. In a transformer block, the value is a vector contribution to the next representation. In biology, the downstream effect can be delayed, nonlinear, and indirect. Binding a factor might increase transcription immediately, repress it through a partner, prime a region for later activation, or do almost nothing until another signal arrives.

Yet the analogy survives those differences because it lives at the level of architecture, not mechanism. The important point is not that cells compute softmax. They do not. The important point is that both systems separate matching from effect. They use a relatively compact set of regulators to address a much larger space of potential targets. They make context-sensitive selection the hinge between state and action.

That separation is powerful enough to keep reappearing.

The brain keeps rhyming with the same idea

The convergence story gets even more interesting when you look beyond gene regulation.

Neuroscience has started to find patterns that sound suspiciously familiar to people who spend time around language models. Eddie Chang’s group has reported neurons in human cortex that appear to track position within linguistic sequences, increasing their firing over the course of a phrase and then resetting. It is tempting to call these positional encodings with a pulse. The phrase is a little too neat, but the resemblance is hard to ignore.

Other researchers, including Trenton Bricken and collaborators, have argued that brains may implement attention-like routing principles as part of how they manage relevance and context. That does not mean the cortex is secretly a transformer with bad tooling. It means engineered systems and evolved nervous systems may be circling the same tradeoffs from opposite directions.

One reason this matters is methodological. AI gives us explicit, inspectable systems that solve hard information problems at useful scale. Neuroscience gives us biological systems shaped by long optimization under energy, noise, and embodiment constraints. When similar motifs appear in both places, they deserve extra scrutiny.

The easy mistake is to take every rhyme as proof of identity. We have done this before. People once tried very hard to see the brain as a telephone exchange, then as a digital computer, then as a Bayesian machine, each metaphor illuminating something and distorting plenty. “The brain is a transformer” would be the latest version of that habit, and it would age about as well.

Still, shared design pressures do seem to generate shared structural answers. Sequence tracking shows up because order matters. Selective routing shows up because relevance is sparse. Hierarchical control shows up because local adjustments need global consequences. These are not arbitrary stylistic choices. They are responses to constraints.

Attention looks more like discovery than branding

There is a quiet shift in perspective that comes from taking this seriously.

If transcription factors and modern attention mechanisms solve related problems with related abstractions, then attention stops looking like a clever fad of 2017. It starts to look like a recurring pattern for handling large, context-dependent spaces of possible action. Biology found one version through evolution. Machine learning found another through gradient-based optimization and human tinkering. The details differ wildly. The shape of the solution does not.

That should make AI researchers more curious about biology, though not in the lazy “nature already solved everything” sense. Biology solved many things badly, slowly, and with side effects that would get your pull request rejected on sight. But it also contains design patterns pressure-tested across absurd spans of time. Regulatory indirection, combinatorial control, sparse routing, memory embedded in structure rather than explicit symbols—these are rich ideas.

It should also make biologists a little less dismissive of AI analogies. Some are shallow. Many deserve to be. But occasionally an imported concept sharpens the actual biology. Calling transcription factors “queries” is reductive if you stop there. It becomes illuminating if it helps you notice that gene regulation is not mainly about storing instructions. It is about selective access to instructions under context.

The broader implication is almost philosophical. Complex systems may not have that many good ways to remain both flexible and controllable. Once the state space gets large enough, some form of attention-like routing may become close to inevitable. You need a way to focus limited causal force on the right subset of possibilities, using representations that can generalize across new contexts.

That is a much bigger claim than “transformers are good at text.” It suggests we are bumping into a principle of organized intelligence, one that was written into cells long before anyone named an embedding dimension.