Table of Contents
Fetching ...

The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns

Elyes Hajji, Aymen Bouguerra, Fabio Arnez

TL;DR

The paper tackles hallucination detection in large language models by distinguishing intrinsic from extrinsic errors and proposing a principled evaluation framework. It introduces RAUQ, an attention-based uncertainty propagation method with multiple token- and head-aggregation variants to estimate confidence efficiently. A structured benchmark and extensive experiments across six open-source LLMs show that attention-based methods excel at intrinsic hallucination detection, while sampling-based methods remain strong for extrinsic cases, highlighting a type-aware approach. The results advocate for deploying lightweight, attention-driven uncertainty signals in safety-critical settings and chart directions for future research on grounding and detection strategies.

Abstract

Large Language Models (LLMs) are increasingly deployed in safety-critical domains, yet remain susceptible to hallucinations. While prior works have proposed confidence representation methods for hallucination detection, most of these approaches rely on computationally expensive sampling strategies and often disregard the distinction between hallucination types. In this work, we introduce a principled evaluation framework that differentiates between extrinsic and intrinsic hallucination categories and evaluates detection performance across a suite of curated benchmarks. In addition, we leverage a recent attention-based uncertainty quantification algorithm and propose novel attention aggregation strategies that improve both interpretability and hallucination detection performance. Our experimental findings reveal that sampling-based methods like Semantic Entropy are effective for detecting extrinsic hallucinations but generally fail on intrinsic ones. In contrast, our method, which aggregates attention over input tokens, is better suited for intrinsic hallucinations. These insights provide new directions for aligning detection strategies with the nature of hallucination and highlight attention as a rich signal for quantifying model uncertainty.

The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns

TL;DR

The paper tackles hallucination detection in large language models by distinguishing intrinsic from extrinsic errors and proposing a principled evaluation framework. It introduces RAUQ, an attention-based uncertainty propagation method with multiple token- and head-aggregation variants to estimate confidence efficiently. A structured benchmark and extensive experiments across six open-source LLMs show that attention-based methods excel at intrinsic hallucination detection, while sampling-based methods remain strong for extrinsic cases, highlighting a type-aware approach. The results advocate for deploying lightweight, attention-driven uncertainty signals in safety-critical settings and chart directions for future research on grounding and detection strategies.

Abstract

Large Language Models (LLMs) are increasingly deployed in safety-critical domains, yet remain susceptible to hallucinations. While prior works have proposed confidence representation methods for hallucination detection, most of these approaches rely on computationally expensive sampling strategies and often disregard the distinction between hallucination types. In this work, we introduce a principled evaluation framework that differentiates between extrinsic and intrinsic hallucination categories and evaluates detection performance across a suite of curated benchmarks. In addition, we leverage a recent attention-based uncertainty quantification algorithm and propose novel attention aggregation strategies that improve both interpretability and hallucination detection performance. Our experimental findings reveal that sampling-based methods like Semantic Entropy are effective for detecting extrinsic hallucinations but generally fail on intrinsic ones. In contrast, our method, which aggregates attention over input tokens, is better suited for intrinsic hallucinations. These insights provide new directions for aligning detection strategies with the nature of hallucination and highlight attention as a rich signal for quantifying model uncertainty.

Paper Structure

This paper contains 19 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The Hallucination Map. AUROC scores for uncertainty estimation methods on extrinsic vs. intrinsic hallucination detection across all datasets. The proposed aggregation strategies in the RAUQ attention-based method outperform baselines and exhibit balanced performance across both hallucination types.
  • Figure 2: Comparison of AUROC$\uparrow$, AURAC$\uparrow$, and PRR$\uparrow$ scores for all hallucination detection methods, computed over the concatenation of all datasets. Error bars indicate $\pm$ one standard deviation across models. The axis for each metric is colored to match its bars, indicating the corresponding scale. Dashed lines indicate the performance of the RAUQ baseline. This figure provides a comprehensive overview of each method’s overall detection performance. Some RAUQ variants, particularly the Rollout and Mean‑Heads aggregations, consistently achieve the highest performance overall.
  • Figure 3: Comparison of hallucination detection methods across extrinsic and intrinsic hallucination benchmarks using AUROC$\uparrow$, AURAC$\uparrow$, and PRR$\uparrow$. Error bars indicate the 95% confidence intervals. The axis for each metric is colored to match its bars, indicating the corresponding scale. Dashed lines indicate the performance of the RAUQ baseline. While Semantic Entropy performs well on extrinsic hallucinations, attention-based variants aggregating over input or all tokens consistently achieve higher performance on intrinsic hallucinations. The butterfly split highlights divergent performance trends depending on hallucination type.
  • Figure 4: Normalized histograms showing the distribution of RAUQ Mean Heads scores for Mistral‑7B across SQuAD v2 Answerable (blue), Unanswerable (orange), and NonExistentRefusal-MixedEntities (grey). The red dashed line indicates the optimal threshold, obtained by maximizing the G-mean (TPR and TNR) across all datasets.