Table of Contents
Fetching ...

Dissociating Direct Access from Inference in AI Introspection

Harvey Lederman, Kyle Mahowald

TL;DR

This work extensively replicates Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models, and shows that these models detect injected representations via two separable mechanisms: probability-matching and direct access to internal states.

Abstract

Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.

Dissociating Direct Access from Inference in AI Introspection

TL;DR

This work extensively replicates Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models, and shows that these models detect injected representations via two separable mechanisms: probability-matching and direct access to internal states.

Abstract

Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
Paper Structure (50 sections, 15 figures, 2 tables)

This paper contains 50 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Our main experiments. Left, Experiment 1. Top, first person: steered models detect injection, but confabulate identification; both have "apple" as #1 wrong guess. Bottom, third person: steered model (circular robot who observe the preceding conversation) reports whether another model (square-headed robot) has had a thought injected. Right, Experiment 2 Top, first person primed: priming aids identification but not detection. Bottom, third person primed: unsteered control reports detection; steering can reduce detection.
  • Figure 2: First-person (blue solid) vs. third-person (red solid) detection rates by layer across all 821 concepts (coherent responses only). Dashed blue lines show correct identification rates for first-person. Faded purple shows first-person advantage.
  • Figure 3: Logit lens for coherent NO trials (Lindsey's 50 concepts only). Lines show p(yes)/p(no) from injection layer on. Both models show dramatically elevated ratios (colored lines) relative to control (black dashed), with suppression at late layers.
  • Figure 4: Experiment 2: Llama and Qwen in both primed and unprimed conditions for (A) First-person detection rates, (B) First-person correct identification rates, and (C) Third-person primed rates. Controls are not injected, so there are no inter-layer differences. Not picture: first-person control detection is 0 in all cases (no false positives).
  • Figure 5: Prompt-only vs. continuous steering (strength 6.0, coherent responses only). (A) Detection rates are similar across conditions; steering during generation is not required for detection. (B) Correct concept mentions (verified using string matching) are higher with continuous steering, especially at later layers. Error bars show 95% Wilson CIs.
  • ...and 10 more figures