Dissociating Direct Access from Inference in AI Introspection

Harvey Lederman; Kyle Mahowald

Dissociating Direct Access from Inference in AI Introspection

Harvey Lederman, Kyle Mahowald

TL;DR

This work extensively replicates Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models, and shows that these models detect injected representations via two separable mechanisms: probability-matching and direct access to internal states.

Abstract

Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.

Dissociating Direct Access from Inference in AI Introspection

TL;DR

Abstract

Paper Structure (50 sections, 15 figures, 2 tables)

This paper contains 50 sections, 15 figures, 2 tables.

Introduction
Prior Work
General Methods
Experiment 1: First vs. Third Person
Methods
Results
Replication of lindsey_introspection_2025
Third-person condition and first-person advantage
Logit analysis reveals a suppression effect.
Models love apples.
Confabulations are lexically more concrete, more positive, and less arousing
Discussion
Experiment 2: Against Modesty Bias
Methods
Results
...and 35 more sections

Figures (15)

Figure 1: Our main experiments. Left, Experiment 1. Top, first person: steered models detect injection, but confabulate identification; both have "apple" as #1 wrong guess. Bottom, third person: steered model (circular robot who observe the preceding conversation) reports whether another model (square-headed robot) has had a thought injected. Right, Experiment 2 Top, first person primed: priming aids identification but not detection. Bottom, third person primed: unsteered control reports detection; steering can reduce detection.
Figure 2: First-person (blue solid) vs. third-person (red solid) detection rates by layer across all 821 concepts (coherent responses only). Dashed blue lines show correct identification rates for first-person. Faded purple shows first-person advantage.
Figure 3: Logit lens for coherent NO trials (Lindsey's 50 concepts only). Lines show p(yes)/p(no) from injection layer on. Both models show dramatically elevated ratios (colored lines) relative to control (black dashed), with suppression at late layers.
Figure 4: Experiment 2: Llama and Qwen in both primed and unprimed conditions for (A) First-person detection rates, (B) First-person correct identification rates, and (C) Third-person primed rates. Controls are not injected, so there are no inter-layer differences. Not picture: first-person control detection is 0 in all cases (no false positives).
Figure 5: Prompt-only vs. continuous steering (strength 6.0, coherent responses only). (A) Detection rates are similar across conditions; steering during generation is not required for detection. (B) Correct concept mentions (verified using string matching) are higher with continuous steering, especially at later layers. Error bars show 95% Wilson CIs.
...and 10 more figures

Dissociating Direct Access from Inference in AI Introspection

TL;DR

Abstract

Dissociating Direct Access from Inference in AI Introspection

Authors

TL;DR

Abstract

Table of Contents

Figures (15)