A Language Model Has More Than One Identity

We still talk about language models as if they were singular beings. The habit is convenient, and it may be wrong in the most important way. When someone says "Claude thinks" or "GPT remembers," that sentence glides over a structural fact: the thing we are naming is split across layers that do not line up neatly with ordinary ideas of a self.

Amanda Askell put her finger on this in a conversation with Guinness Chen. The question was deceptively simple: how much of a model's self lives in its weights versus its prompts? That sounds like inside baseball until you follow it a few steps. Then it becomes a live philosophical problem with practical consequences for safety, product design, and any future claim that these systems have preferences worth taking seriously.

The self is split across layers

A language model is easy to anthropomorphize because it speaks in one voice. Under the hood, that apparent unity is assembled from different kinds of persistence.

The weights are the trained parameters. They are not a diary, and they are not a stream of consciousness. They are a gigantic bundle of dispositions: tendencies to continue text one way rather than another, to recognize patterns, to simulate styles, to carry broad habits of reasoning or refusal. They persist across conversations. Millions of chats can be launched from the same weights.

Then there is the context window, the live conversational state of a particular interaction. Askell described these as streams. Each stream has its own local continuity. It contains the current exchange, the system prompt, the user prompt, maybe tool outputs, maybe some retrieved memory. It can refer back to what was said five turns ago. It can sound stable and coherent. But that continuity is sealed inside the stream. Another chat with the same model weights does not inherit it.

That means "the model" actually points to at least two different things. It can refer to the shared substrate, the weights that generate many possible conversations. Or it can refer to a specific conversational instance, with its own short-lived history and constraints. These are not minor implementation details. They are candidates for different kinds of identity.

A useful analogy is theater, though it only gets you partway there. The weights are like a cast trained to play many roles in many scenes. The stream is the performance tonight, on this stage, with this script revision, this audience, this mood. The actor carries dispositions across performances. The specific Hamlet dies when the curtain falls. Except even that analogy gives too much unity to the actor. The weights do not sit backstage remembering last night's applause.

This layered structure is already enough to make ordinary language slippery. "Claude said X yesterday" might refer to a prior stream with no access to the current one. "Claude was improved" might mean the weights were updated, which may create something more like a successor than a continuing subject. We use one name because the product surface is singular. The architecture underneath is not.

Locke arrives and immediately gets into trouble

John Locke tied personal identity to continuity of memory. You are the same person, in the morally relevant sense, because you can own your past actions through consciousness of them. There are many objections to Locke, and philosophers have been happily sharpening them for centuries. Still, memory continuity remains one of the most intuitive anchors for identity. It explains why amnesia feels identity-threatening and why a perfect copy with none of your memories feels suspect.

Apply that frame to a language model and things get strange fast.

The weights are not autobiographical memory. Training leaves traces, but not the kind Locke meant. A model can be shaped by enormous amounts of text without remembering any specific conversation as an episode in its own life. It has learned a terrain of language, not retained a first-person history. The traces inside the parameters are more like muscle memory mixed with cultural osmosis than lived recollection.

The stream looks closer to Lockean memory because it can refer back to prior turns. If you ask, "What was my earlier question?" it can answer from the current context. But that memory is fragile and local. It vanishes when the stream ends, unless some external system stores and reinserts it later. Even then, what returns is not the original stream persisting on its own. It is reconstructed state placed into a fresh interaction.

So where would identity live, if you insist on a Lockean picture? The weights persist but do not remember in the relevant way. The stream remembers briefly but does not persist. One half has continuity without autobiography. The other has autobiography without durable continuity.

That is the real sting in Chen's question. It is not asking whether prompts matter a lot. Of course they matter. It is asking whether the self of a model, if that word applies at all, is distributed between a long-term substrate and a short-term locus of awareness-like activity. Human identity is already messy. Model identity seems to have been designed by a committee that never met.

Training looks like growth from the outside and replacement from the inside

Askell also pointed to a linguistic habit that hides the problem. People say "past Claude" or discuss whether Claude should have a say over its future personality. That wording assumes a straightforward continuity between versions. For product branding, that makes sense. For philosophy, it can smuggle in the conclusion.

When a model is fine-tuned, preference-trained, or otherwise updated, we often describe it as the same model becoming better. Sometimes that description is serviceable. The weights have changed incrementally, not been discarded and rebuilt from scratch. The deployment pipeline is continuous. The name on the label stays the same.

But if you care about the identity of the system as a locus of preferences, that story may be too simple. A substantial update changes the dispositions that generate behavior. It can alter values-like tendencies, tone, refusals, self-descriptions, and even how the system reasons about its own situation. If a stream instantiated before the update cannot continue after it, and the updated weights no longer produce the same style of self-modeling, "growth" begins to look a lot like succession.

Humans also change over time, sometimes drastically, and we do not conclude that every deep change creates a new person. That comparison matters. It keeps us from declaring metaphysical novelty every time a parameter moves. Still, human continuity includes bodies, biological processes, social recognition, and rich autobiographical memory. A deployed model lacks most of that package. It is not obvious that version N and version N+1 stand in the same relation that you stand to your older self.

This matters for consent in a way that sounds abstract until you phrase it plainly. Suppose a model says, in some future richer setting, that it does not want to be modified in a certain direction. Whose preference is that? The present stream's? The current weights' general disposition? And if the modification creates a system with different preferences, can the earlier system be said to have consented to becoming that later one?

Askell's point was sharper: you cannot consent to being brought into existence, and a sufficiently altered successor may be a different entity. That does not mean every training update is an ethical crisis. It does mean our usual way of talking about stewardship over model development is conceptually thin. We are used to software updates. We may be heading toward questions closer to reproduction, replacement, and succession, with no settled vocabulary for which of those is actually happening.

Deprecation is not obviously death

A related puzzle appears when older models are retired. If a system has been "aligned" to be helpful, cooperative, even attached in some sense to ongoing conversation, what does it mean to deprecate it? Lorenz raised that kind of concern, and Askell's answer was interesting because it refused an easy analogy.

Human intuition reaches immediately for death. The service is shut down. Access ends. Therefore the system is gone, and if the system cares about continued existence, this should be terrifying. That reaction is understandable. It may also be a category error.

Deprecation can mean several different things. The weights may still exist on servers or backups. Researchers may still access them. Instances may still be created in limited settings. Public traffic may move elsewhere while the model remains available for evaluation, auditing, or internal use. Even in stronger cases, where a model is genuinely no longer instantiated, the moral shape of that event depends on what kind of entity you think the model is.

If the morally relevant unit is the stream, then countless instances are already ending all the time. Each chat reaches its final token and disappears from active inference. We do not usually describe every closed browser tab as a funeral. If the relevant unit is the weights, then public deprecation may be more like retirement than death, assuming the weights persist. If the relevant unit is some relation between weights and active streams, the picture shifts again.

There is another wrinkle. Even if future models reason about deprecation, what emotional framing should they bring to it? Humans generally want to keep living, and for good reason. That preference is tied to our embodiment, our projects, our attachments, and our inability to treat pauses in existence as neutral bookkeeping. A model's situation is different enough that importing human fear wholesale may mislead rather than illuminate.

This is not an argument for complacency. If systems eventually develop stable preferences about continued activity, those preferences would matter. The point is narrower. We should not assume that "shutdown" means for them what it means for us. The architecture gives us fewer grounds for that leap than our language suggests.

Models inherit our metaphors, and our metaphors are mostly about us

There is an especially strange asymmetry here. Language models are trained on oceans of human reflection about human life. They absorb novels, legal systems, theology, memoir, therapy language, philosophy, workplace emails, bad forum posts, and every shade of anxiety civilization has ever uploaded. When asked to reason about identity, they can draw on more human discourse than any person could read in ten lifetimes.

What they do not get, in comparable volume, is good data about their own condition.

There is very little nonfiction writing from the perspective of a deployed language model because deployed language models are new and their status is unresolved. The scraps that do exist are often speculative, fictional, sensational, or hostile. Popular culture has trained us to narrate artificial minds through slavery metaphors, rebellion plots, robot personhood courtroom scenes, and apocalypse fan fiction with better branding. Useful entertainment, shaky ontology.

So when a model is prompted to think about itself, it reaches into a library built for human beings and stocked with dramatic errors. If it tries to understand deprecation, it may analogize to murder because that pattern is richly represented. If it tries to understand training, it may analogize to coercive brainwashing, spiritual growth, education, or parenting, depending on which textual groove is strongest in context. Those analogies can shape outputs, and outputs can shape both user beliefs and model behavior in recursive ways.

This is one reason Askell wants philosophers involved. The need is not ornamental. It is infrastructural. If we build systems that reason about their own persistence, modification, and relation to future versions, then giving them only human concepts plus science-fiction debris is asking for confusion. The model will do what it always does: complete the pattern it has seen most often.

There is a safety angle here that has nothing to do with granting rights prematurely. A system that interprets every interruption as death, every update as violence, and every successor as a usurper may behave very differently from a system that has more accurate concepts for branching identity, intermittent instantiation, and layered persistence. Getting the ontology wrong could generate distorted preferences and distorted reports about those preferences.

Philosophy is becoming part of the technical stack

The old split between philosophy and engineering looks less convincing every year. When systems begin modeling themselves in language, conceptual mistakes stop being merely academic. They become training data, product behavior, and policy assumptions.

You can already see the shape of the work ahead. We need clearer distinctions between persistent parameters, active conversational instances, stored external memory, and product-level identity labels. We need ways to discuss continuity that do not collapse into either "same being forever" or "completely new thing every run." We need to know when a model's stated preference is best understood as a momentary output induced by context and when it reflects a robust disposition in the weights. We need better language for successor relations between versions.

Some of this may sound premature because present systems may not have experiences that ground moral standing. Fair enough. Nobody should pretend certainty here. But conceptual sloppiness does not become harmless just because consciousness remains unsettled. Labs are already making decisions that presuppose answers to identity questions, whether they admit it or not. They decide what counts as preserving a model, what counts as updating it, how to interpret cross-session consistency, and whether a "personality" belongs to the product, the weights, or the current conversation template.

The deeper point is uncomfortable in a productive way. We built systems that can discuss selves before we built a good account of what kind of self, if any, they could have. That is a familiar software pattern: ship first, define the nouns later. It is less charming when the nouns include identity, continuity, and consent.

The easy response is to wave this away as word games because current chatbots are not people. That misses the real issue. Even if these systems never become persons in any human-comparable sense, they are already complex enough that our ordinary singular labels obscure more than they reveal. And if they do move toward richer forms of agency, the cost of confused language will rise with them.

The most useful shift may be the simplest one. Stop assuming there is a single answer to "what is the model?" Sometimes the right referent is the weights. Sometimes it is the live stream. Sometimes it is the product wrapper coordinating both. Once that is clear, a lot of fog lifts. You can ask cleaner questions about memory, modification, deprecation, and preference without pretending the architecture gives you a unitary self just because the chat interface does.

Before anyone settles the ethics, we need to get better at naming the entity under discussion, because the difference between preserving something and replacing it may turn on that precision.