When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs
Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li
TL;DR
This work investigates spurious correlations as a primary driver of hallucinations in large language models, showing that surface-level statistical shortcuts can yield high-confidence, ground-truth-conflicting outputs that evade standard detectors. The authors construct a controllable synthetic framework by tying surnames to attributes with a correlation strength $\rho$ and demonstrate that detection methods deteriorate as $\rho$ increases, persisting across model scales and even after refusal-fine-tuning. They validate the phenomenon on real-world LLMs (including GPT-5) using entity-cooccurrence proxies and SimpleQA, where higher co-occurrence correlates with more confident but incorrect answers and reduced detectability. A theoretical kernel-based model then explains why confidence-based detection fails under strong correlations, highlighting a tension between memorization and generalization and underscoring the need for new approaches that address spurious correlations along the model development lifecycle. Overall, the paper emphasizes that mitigating hallucinations requires strategies beyond traditional uncertainty and internal-probe methods, focusing on identifying and mitigating shortcut-like correlations in training data and fine-tuning data contributions.
Abstract
Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.
