Table of Contents
Fetching ...

Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations

Ahmed Karim, Fatima Sheaib, Zein Khamis, Maggie Chlon, Jad Awada, Leon Chlon

TL;DR

This work investigates procedural hallucinations in language models, where correct information is encoded but not used at readout. The authors formalize a two-stage readout framework (Stage 2A gating and Stage 2B binding) and quantify routing efficiency with information-theoretic measures, distinguishing available vs. used information. Empirically, Stage 2B errors dominate in hard long-context binding tasks, yet linear probes can recover the correct value on error trials, validating the 'present but not used' hypothesis. They introduce pseudo-priors and structure-preserving ablations to certify the information budget required to overcome biases, and demonstrate mitigation via activation patching and oracle checkpointing that restates bindings near the query to restore long-distance accuracy. An accompanying reproducibility toolkit provides diagnostics and protocols to apply these methods to API models, enabling practical auditing and mitigation of procedural hallucinations in real-world deployments.

Abstract

Large language models can follow complex procedures yet fail at a seemingly trivial final step: reporting a value they themselves computed moments earlier. We study this phenomenon as \emph{procedural hallucination}: failure to execute a verifiable, prompt-grounded specification even when the correct value is present in context. In long-context binding tasks with a known single-token candidate set, we find that many errors are readout-stage routing failures. Specifically, failures decompose into Stage~2A (gating) errors, where the model does not enter answer mode, and Stage~2B (binding) errors, where it enters answer mode but selects the wrong candidate (often due to recency bias). In the hard regime, Stage~2B accounts for most errors across model families in our tasks (Table~1). On Stage~2B error trials, a linear probe on the final-layer residual stream recovers the correct value far above chance (e.g., 74\% vs.\ 2\% on Qwen2.5-3B; Table~2), indicating that the answer is encoded but not used. We formalize ``present but not used'' via available vs.\ used mutual information and pseudo-prior interventions, yielding output-computable diagnostics and information-budget certificates. Finally, an oracle checkpointing intervention that restates the true binding near the query can nearly eliminate Stage~2B failures at long distance (e.g., Qwen2.5-3B $0/400 \rightarrow 399/400$ at $k = 1024$; Table~8).

Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations

TL;DR

This work investigates procedural hallucinations in language models, where correct information is encoded but not used at readout. The authors formalize a two-stage readout framework (Stage 2A gating and Stage 2B binding) and quantify routing efficiency with information-theoretic measures, distinguishing available vs. used information. Empirically, Stage 2B errors dominate in hard long-context binding tasks, yet linear probes can recover the correct value on error trials, validating the 'present but not used' hypothesis. They introduce pseudo-priors and structure-preserving ablations to certify the information budget required to overcome biases, and demonstrate mitigation via activation patching and oracle checkpointing that restates bindings near the query to restore long-distance accuracy. An accompanying reproducibility toolkit provides diagnostics and protocols to apply these methods to API models, enabling practical auditing and mitigation of procedural hallucinations in real-world deployments.

Abstract

Large language models can follow complex procedures yet fail at a seemingly trivial final step: reporting a value they themselves computed moments earlier. We study this phenomenon as \emph{procedural hallucination}: failure to execute a verifiable, prompt-grounded specification even when the correct value is present in context. In long-context binding tasks with a known single-token candidate set, we find that many errors are readout-stage routing failures. Specifically, failures decompose into Stage~2A (gating) errors, where the model does not enter answer mode, and Stage~2B (binding) errors, where it enters answer mode but selects the wrong candidate (often due to recency bias). In the hard regime, Stage~2B accounts for most errors across model families in our tasks (Table~1). On Stage~2B error trials, a linear probe on the final-layer residual stream recovers the correct value far above chance (e.g., 74\% vs.\ 2\% on Qwen2.5-3B; Table~2), indicating that the answer is encoded but not used. We formalize ``present but not used'' via available vs.\ used mutual information and pseudo-prior interventions, yielding output-computable diagnostics and information-budget certificates. Finally, an oracle checkpointing intervention that restates the true binding near the query can nearly eliminate Stage~2B failures at long distance (e.g., Qwen2.5-3B at ; Table~8).
Paper Structure (79 sections, 10 theorems, 25 equations, 2 figures, 15 tables)

This paper contains 79 sections, 10 theorems, 25 equations, 2 figures, 15 tables.

Key Result

Proposition 1

For any $k$, we have $0\le \eta_k \le 1$.

Figures (2)

  • Figure 1: Framework overview: Stage 2A gating and Stage 2B binding failures correspond to low routing efficiency $I_{\mathrm{used}}/I_{\mathrm{avail}}$, diagnosed via pseudo-prior interventions.
  • Figure 2: Spotlight summary of core empirical claims.(A) Stage decomposition for representative hard-regime settings: most errors are Stage 2B (binding) rather than Stage 2A (gating), i.e., the model enters answer mode but selects the wrong candidate. (B) Checkpointing (restating the true binding every 128 tokens near the query) substantially recovers long-distance binding for Qwen2.5-3B; on competing_vars at $k=1024$, it converts 0/400$\to$399/400 correct. Error bars are 95% Wilson binomial confidence intervals over $n=400$ trials per cell.

Theorems & Definitions (21)

  • Definition 1: Procedural hallucination
  • Definition 2: Available and used information
  • Proposition 1: Data processing
  • Theorem 1: Fano lower bound
  • Proposition 2: Minimax tightness
  • Proposition 3: Fano slack decomposition
  • Corollary 1: Fano inversion
  • Definition 3: Pseudo-prior
  • Theorem 2: Bernoulli-projected decompression bound
  • Corollary 2: Bits-to-trust
  • ...and 11 more