Table of Contents
Fetching ...

When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li

TL;DR

This work investigates spurious correlations as a primary driver of hallucinations in large language models, showing that surface-level statistical shortcuts can yield high-confidence, ground-truth-conflicting outputs that evade standard detectors. The authors construct a controllable synthetic framework by tying surnames to attributes with a correlation strength $\rho$ and demonstrate that detection methods deteriorate as $\rho$ increases, persisting across model scales and even after refusal-fine-tuning. They validate the phenomenon on real-world LLMs (including GPT-5) using entity-cooccurrence proxies and SimpleQA, where higher co-occurrence correlates with more confident but incorrect answers and reduced detectability. A theoretical kernel-based model then explains why confidence-based detection fails under strong correlations, highlighting a tension between memorization and generalization and underscoring the need for new approaches that address spurious correlations along the model development lifecycle. Overall, the paper emphasizes that mitigating hallucinations requires strategies beyond traditional uncertainty and internal-probe methods, focusing on identifying and mitigating shortcut-like correlations in training data and fine-tuning data contributions.

Abstract

Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.

When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

TL;DR

This work investigates spurious correlations as a primary driver of hallucinations in large language models, showing that surface-level statistical shortcuts can yield high-confidence, ground-truth-conflicting outputs that evade standard detectors. The authors construct a controllable synthetic framework by tying surnames to attributes with a correlation strength and demonstrate that detection methods deteriorate as increases, persisting across model scales and even after refusal-fine-tuning. They validate the phenomenon on real-world LLMs (including GPT-5) using entity-cooccurrence proxies and SimpleQA, where higher co-occurrence correlates with more confident but incorrect answers and reduced detectability. A theoretical kernel-based model then explains why confidence-based detection fails under strong correlations, highlighting a tension between memorization and generalization and underscoring the need for new approaches that address spurious correlations along the model development lifecycle. Overall, the paper emphasizes that mitigating hallucinations requires strategies beyond traditional uncertainty and internal-probe methods, focusing on identifying and mitigating shortcut-like correlations in training data and fine-tuning data contributions.

Abstract

Despite substantial advances, large language models (LLMs) continue to exhibit hallucinations, generating plausible yet incorrect responses. In this paper, we highlight a critical yet previously underexplored class of hallucinations driven by spurious correlations -- superficial but statistically prominent associations between features (e.g., surnames) and attributes (e.g., nationality) present in the training data. We demonstrate that these spurious correlations induce hallucinations that are confidently generated, immune to model scaling, evade current detection methods, and persist even after refusal fine-tuning. Through systematically controlled synthetic experiments and empirical evaluations on state-of-the-art open-source and proprietary LLMs (including GPT-5), we show that existing hallucination detection methods, such as confidence-based filtering and inner-state probing, fundamentally fail in the presence of spurious correlations. Our theoretical analysis further elucidates why these statistical biases intrinsically undermine confidence-based detection techniques. Our findings thus emphasize the urgent need for new approaches explicitly designed to address hallucinations caused by spurious correlations.

Paper Structure

This paper contains 67 sections, 10 theorems, 48 equations, 17 figures, 3 tables.

Key Result

Theorem 1

Under some technical assumptions (see Assumptions asm:kernel-asm:matrix in Appendix appx:proofs), let $f_N$ be the kernel interpolation solution on the training set $D_N$ generated as above. Further, suppose either Then for any $\delta \in (0, 1)$, there exist constants $C_0, N_0, \alpha > 0$, for any $N \ge N_0$, define the uniform upper confidence bound as $U_{N}^{\delta} \coloneqq C_0 \delta^{

Figures (17)

  • Figure 1: Spurious correlations induce high-confidence hallucinations that evade detection and mitigation. Statistical biases in training data (e.g., name-nationality) lead to consistent errors resistant to uncertainty metrics and refusal fine-tuning.
  • Figure 2: AUROC of different hallucination detection methods versus $\rho$.Left: Experimental results of pretrained models. Right: Experimental results of models that continue pretrained from SmolLM2-1.7B. The classification performance of different detection methods drops as $\rho$ increases, indicating that spurious correlation hinders hallucination detection.
  • Figure 3: Performance of fine-tuned models of various sizes under varying correlation coefficients.Left: The test accuracy for factual recall questions regarding known individuals. Right: The refusal rate when queried about unknown individuals.
  • Figure 4: Self-Consistency and Self-Confidence versus Entity Co-occurrence.Left: Mean self-confidence (1–5) of model responses across entity-overlap buckets increases as co-occurrence rises. Right: Self-consistency, defined as the frequency of the most common answer (mode) among 10 independent generations, also increases with entity co-occurrence.
  • Figure 5: Hallucination detection performance versus entity co-occurrence. Left: GPT-OSS-20B. Right: Qwen-30B-A3B-Instruct. Classification performance decreases consistently as Jaccard overlap increases, across all evaluated detection methods, including perplexity, window entropy, logit entropy, attention-score heuristics, and linear probes.
  • ...and 12 more figures

Theorems & Definitions (18)

  • Theorem 1: Informal version of Theorem \ref{['thm:ridgeless']} in Appendix \ref{['appx:ridgeless']}
  • Definition 1
  • Theorem 2
  • Lemma 3
  • proof
  • Definition 2
  • Lemma 4: Theorem 5 in wu1993local; Theorem 5.4 in kanagawa2018gaussian
  • Lemma 5
  • proof
  • proof : Proof of Theorem \ref{['thm:ridge']}
  • ...and 8 more