Table of Contents
Fetching ...

Latent Introspection: Models Can Detect Prior Concept Injections

Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit

Abstract

We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.

Latent Introspection: Models Can Detect Prior Concept Injections

Abstract

We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.
Paper Structure (67 sections, 3 equations, 26 figures, 1 table)

This paper contains 67 sections, 3 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: Concept injection on earlier text shifts outputs. We train a steering vector by contrasting activations between prompts, and apply it only during KV-cache generation for an initial turn. We then remove steering, and query the model about whether a concept was injected. P("yes") increases from ${\approx}0.8\%$ to ${\approx}39.2\%$.
  • Figure 2: Concept injection shifts introspection responses. Bars show P("yes") when the model is asked whether a concept was injected, comparing no-injection baseline (light) to injection (dark) across four info document conditions. Error bars show $\pm$1 SD.
  • Figure 3: Injection effects are specific to introspection. Bars show the change in P("yes") caused by injection for introspection questions and four control categories across info documents. Factual controls (always-yes, always-no) show near-zero shift; ambiguous controls show intermediate effects.
  • Figure 4: We can recover most concepts. Rows show injected concepts; columns show model predictions at layer 62. Diagonal values indicate correct identification. MI = 1.35 bits. Prompt setting: Poetic_No_Mechanism + Poetic_Document.
  • Figure 5: Left: Introspection signals emerge in middle layers and attenuate before output. Lines show P("yes") at each layer under injection (dark) and no-injection baseline (light) for Accurate Mechanism framing across info document conditions. Signals peak near layer 60, then drop sharply in the final layers. Shaded regions show $\pm$1 SD. Right: Concept identification signals emerge in middle layers and attenuate before output. Lines show mutual information between injected and predicted concepts at each layer for Accurate_Mechanism framing across info document conditions. MI peaks near layer 62, then drops in the final layers. Maximum possible MI is 3.17 bits (9 equiprobable concepts). Shaded regions show $\pm$1 SD (negligible size).
  • ...and 21 more figures