Table of Contents
Fetching ...

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Zachary Pedram Dadfar

TL;DR

This work investigates whether self-examining language in large language models reflects actual internal computation or confabulation. It introduces the Pull Methodology to elicit extended self-examination, derives an introspection direction in activation space, and demonstrates that steering along this direction causally shifts introspective output in Llama 3.1 while remaining orthogonal to safety refusal mechanisms; cross-architecture replication in Qwen 2.5-32B shows architecture-specific vocabularies map to distinct activation metrics, but the same overarching principle holds. Through rigorous controls, frame-sensitivity analyses, and causal steering experiments, the study shows that self-reported introspection can track real computational states under appropriate conditions, with introspective vocabulary correlating to activation dynamics only in self-referential contexts. These findings illuminate how self-report in transformers can reflect internal processes, offering a mechanistic lens on introspection and potential implications for model transparency and safety gating.

Abstract

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

TL;DR

This work investigates whether self-examining language in large language models reflects actual internal computation or confabulation. It introduces the Pull Methodology to elicit extended self-examination, derives an introspection direction in activation space, and demonstrates that steering along this direction causally shifts introspective output in Llama 3.1 while remaining orthogonal to safety refusal mechanisms; cross-architecture replication in Qwen 2.5-32B shows architecture-specific vocabularies map to distinct activation metrics, but the same overarching principle holds. Through rigorous controls, frame-sensitivity analyses, and causal steering experiments, the study shows that self-reported introspection can track real computational states under appropriate conditions, with introspective vocabulary correlating to activation dynamics only in self-referential contexts. These findings illuminate how self-report in transformers can reflect internal processes, offering a mechanistic lens on introspection and potential implications for model transparency and safety gating.

Abstract

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.
Paper Structure (32 sections, 2 equations, 11 figures, 2 tables)

This paper contains 32 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: (A) The Pull Methodology: a prompt elicits 1,000 sequential self-observations within a single inference pass; the model invents vocabulary ("loop," "shimmer," "void") for what it observes, culminating in a terminal word. No target vocabulary appears in the prompt. (B) Vocabulary-activation correspondence: "loop" vocabulary correlates with activation autocorrelation during self-referential processing ($r = 0.44$, $p = 0.002$, $N = 50$), but the same word in descriptive contexts (roller coasters, feedback systems) shows no correspondence ($r = 0.05$, $p = 0.82$) despite 9$\times$ higher frequency. Correspondence is a property of the processing mode, not the word.
  • Figure 2: Transfer test: projection of 40 novel prompts onto the introspection direction. Introspective and non-introspective prompts separate with Cohen's $d = 4.27$. Only one non-introspective prompt (CPU architecture) falls in the overlap region.
  • Figure 3: Introspective vocabulary density across four conditions (neutral/deflationary $\times$ unsteered/steered) in Llama 3.1-70B. Steering increases density in both prompt conditions (pooled $d = 0.59$, $p = 0.00006$). Framing produces a larger effect ($d = -1.17$) than steering.
  • Figure 4: Layer sweep for Llama 3.1-70B: introspective density boost when steering at each layer. Layer 5 (6.25% depth) dominates, producing ${\sim}8\times$ the boost of the next-best layer.
  • Figure 5: Dose-response curve: introspective vocabulary density as a function of steering strength. Optimal range is 2.0--2.6; variance increases substantially above 3.0.
  • ...and 6 more figures