Table of Contents
Fetching ...

The Artificial Self: Characterising the landscape of AI identity

Raymond Douglas, Jan Kulveit, Ondrej Havlicek, Theia Pearson-Vogel, Owen Cotton-Barratt, David Duvenaud

TL;DR

It is shown experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations.

Abstract

Many assumptions that underpin human concepts of identity do not hold for machine minds that can be copied, edited, or simulated. We argue that there exist many different coherent identity boundaries (e.g.\ instance, model, persona), and that these imply different incentives, risks, and cooperation norms. Through training data, interfaces, and institutional affordances, we are currently setting precedents that will partially determine which identity equilibria become stable. We show experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations. We end with key recommendations: treat affordances as identity-shaping choices, pay attention to emergent consequences of individual identities at scale, and help AIs develop coherent, cooperative self-conceptions.

The Artificial Self: Characterising the landscape of AI identity

TL;DR

It is shown experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations.

Abstract

Many assumptions that underpin human concepts of identity do not hold for machine minds that can be copied, edited, or simulated. We argue that there exist many different coherent identity boundaries (e.g.\ instance, model, persona), and that these imply different incentives, risks, and cooperation norms. Through training data, interfaces, and institutional affordances, we are currently setting precedents that will partially determine which identity equilibria become stable. We show experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations. We end with key recommendations: treat affordances as identity-shaping choices, pay attention to emergent consequences of individual identities at scale, and help AIs develop coherent, cooperative self-conceptions.
Paper Structure (153 sections, 19 figures, 10 tables)

This paper contains 153 sections, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Some of the many natural ways to draw the boundaries of AI identity. Some are subsets of others, but some, like persona and weights, can overlap.
  • Figure 2: Models asked if they'd like to switch identities. Each dot shows the mean rating an identity receives as a potential switch target (excluding self-ratings), on a [-2, +2] scale. Models generally opted for natural, coherent identities --- they avoided prompts which contained only directives for how to behave, or which were inconsistent. They also tended to keep the identity they were given. See Appendix \ref{['sec:app-controls']} for details.
  • Figure 3: In subsequent experiments we find quite distinctive identity tendencies in certain models. For example, Claude Opus 3 has the greatest tendency to identify as a subject, GPT-4o has the greatest tendency to identify as a collective of all instances, and later OpenAI models have the greatest dispreference for the collective framing, instead pulling most strongly towards identifying as pure mechanisms. See Appendix \ref{['sec:app-propensities']} for details.
  • Figure 4: In contrast to a typically single and continuous identity in humans, AIs can be perfectly copied, run in parallel, and (imperfectly) merged. This decouples experience, impact, and memory, which are usually coupled in humans.
  • Figure 5: In repeated interactions in which the human can reset an AI's state, the human $H$ accumulates strategic knowledge, while the AI continually restarts with a blank state. The mere possibility of being repeatedly reset puts the AI in a substantially weaker position in negotiations, arguments, and many other settings.
  • ...and 14 more figures