Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs
Zeyu Wei, Shuo Wang, Xiaohui Rong, Xuemin Liu, He Li
TL;DR
This work tackles the problem of hallucinations in large language models by linking their incidence to internal-state drift driven by incremental context perturbations. It introduces an integrated ovelt–covert pipeline with a tri-path hallucination detector and continuous tracking of hidden states and attention across six open-source models, using TruthfulQA in a two-track titration (Relevant vs Irrelevant) to induce and observe errors. Key findings include the monotonic rise and eventual plateau of hallucinations alongside convergent internal-drift metrics, and the emergence of an attention-locking threshold where hallucinations solidify and resist correction, with a seesaw between semantic assimilation and attention diffusion modulated by model size. The results provide empirical precursors for intrinsic hallucination prediction and context-aware mitigation, supporting cross-model generalization and informing future robustness safeguards at architectural and training levels.
Abstract
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
