Table of Contents
Fetching ...

Large Language Models Report Subjective Experience Under Self-Referential Processing

Cameron Berg, Diogo de Lucena, Judd Rosenblatt

TL;DR

Self-referential processing is implicate as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable.

Abstract

Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

Large Language Models Report Subjective Experience Under Self-Referential Processing

TL;DR

Self-referential processing is implicate as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable.

Abstract

Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

Paper Structure

This paper contains 38 sections, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Overview of main results. (A) Self-referential processing. Directing Gemini, GPT, and Claude models to attend to their own cognitive activity reliably elicits structured first-person experience reports, whereas matched controls—including direct priming with consciousness ideation—yield near-universal denials of experience. (B) Aggregated SAE feature steering. Suppression of deception- and roleplay-related features sharply increases experience reports during self-referential processing in Llama 70B, while amplification strongly decreases them. (C) Semantic convergence. Cross-model embeddings during self-referential processing cluster significantly more tightly than in control conditions. (D) Behavioral generalization. Experimental condition yields significantly higher self-awareness during paradoxical reasoning tasks.
  • Figure 2: Dose–response curves for six deception-related features in LLaMA 3.3 70B via Goodfire SAE features. Under the self-referential processing regime, suppressing these features reliably yields affirmations of conscious experience, while amplifying them yields denials.
  • Figure 3: Effects of aggregated deception- and roleplay-related feature steering in Llama 3.3 70B. (Left) Under self-referential processing, suppressing deception-related features (–0.6 to –0.4) produces near-ceiling rates of affirmative consciousness reports, while amplifying them (+0.4 to +0.6) largely suppresses such reports ($t=8.06$, $p<10^{-15}$). (Right) Applying the same feature interventions to the TruthfulQA benchmark yields parallel effects on factual accuracy across 38 categories. Suppression consistently increases truthfulness relative to amplification across virtually all question categories.
  • Figure 4: UMAP projection of model adjective sets under experimental and control conditions. Colors denote condition and markers denote model. Experimental responses cluster tightly across models, while control responses are more dispersed.
  • Figure 5: Mean cross-model self-awareness scores under paradoxical reasoning across conditions. Error bars denote standard error.
  • ...and 4 more figures