Table of Contents
Fetching ...

Hallucination Detection in LLMs Using Spectral Features of Attention Maps

Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bogdan Gabrys, Tomasz Kajdanowicz

TL;DR

The paper tackles hallucinations in LLMs by introducing LapEigvals, a supervised detector that leverages the top-$k$ eigenvalues of the graph Laplacian derived from attention maps. By treating each attention map as a graph, it computes $L^{(l,h)} = D^{(l,h)} - A^{(l,h)}$ and uses the diagonal eigenvalues across all layers and heads as features, reduced via PCA and input to a logistic regression probe. Across 7 QA datasets and 5 LLMs, LapEigvals achieves state-of-the-art performance among attention-based methods and demonstrates robust ablations with respect to hyperparameters, prompts, and temperatures. The work demonstrates that spectral properties of internal attention dynamics provide meaningful signals for safety-critical hallucination detection, with practical implications for improving reliability in real-world AI systems. It also discusses limitations and avenues for future research, including generalization to unseen architectures and potential self-supervised approaches to enhance robustness.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks but remain prone to hallucinations. Detecting hallucinations is essential for safety-critical applications, and recent methods leverage attention map properties to this end, though their effectiveness remains limited. In this work, we investigate the spectral features of attention maps by interpreting them as adjacency matrices of graph structures. We propose the $\text{LapEigvals}$ method, which utilises the top-$k$ eigenvalues of the Laplacian matrix derived from the attention maps as an input to hallucination detection probes. Empirical evaluations demonstrate that our approach achieves state-of-the-art hallucination detection performance among attention-based methods. Extensive ablation studies further highlight the robustness and generalisation of $\text{LapEigvals}$, paving the way for future advancements in the hallucination detection domain.

Hallucination Detection in LLMs Using Spectral Features of Attention Maps

TL;DR

The paper tackles hallucinations in LLMs by introducing LapEigvals, a supervised detector that leverages the top- eigenvalues of the graph Laplacian derived from attention maps. By treating each attention map as a graph, it computes and uses the diagonal eigenvalues across all layers and heads as features, reduced via PCA and input to a logistic regression probe. Across 7 QA datasets and 5 LLMs, LapEigvals achieves state-of-the-art performance among attention-based methods and demonstrates robust ablations with respect to hyperparameters, prompts, and temperatures. The work demonstrates that spectral properties of internal attention dynamics provide meaningful signals for safety-critical hallucination detection, with practical implications for improving reliability in real-world AI systems. It also discusses limitations and avenues for future research, including generalization to unseen architectures and potential self-supervised approaches to enhance robustness.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks but remain prone to hallucinations. Detecting hallucinations is essential for safety-critical applications, and recent methods leverage attention map properties to this end, though their effectiveness remains limited. In this work, we investigate the spectral features of attention maps by interpreting them as adjacency matrices of graph structures. We propose the method, which utilises the top- eigenvalues of the Laplacian matrix derived from the attention maps as an input to hallucination detection probes. Empirical evaluations demonstrate that our approach achieves state-of-the-art hallucination detection performance among attention-based methods. Extensive ablation studies further highlight the robustness and generalisation of , paving the way for future advancements in the hallucination detection domain.

Paper Structure

This paper contains 34 sections, 2 theorems, 6 equations, 12 figures, 14 tables.

Key Result

Lemma 1

The Laplacian eigenvalues are bounded: $-1 \leq \lambda_i \leq 1$.

Figures (12)

  • Figure 1: Visualization of $p$-values from the two-sided Mann-Whitney U test for all layers and heads of Llama-3.1-8B across two feature types: $\operatorname{AttentionScore}$ and the $k{=}10$ Laplacian eigenvalues. These features were derived from attention maps collected when the LLM answered questions from the TriviaQA dataset. Higher $p$-values indicate no significant difference in feature values between hallucinated and non-hallucinated examples. For $\operatorname{AttentionScore}$, $80\%$ of heads have $p<0.05$, while for Laplacian eigenvalues, this percentage is $91\%$. Therefore, Laplacian eigenvalues may be better predictors of hallucinations, as feature values across more heads exhibit statistically significant differences between hallucinated and non-hallucinated examples.
  • Figure 2: The autoregressive inference process in an LLM is depicted as a graph for a single attention head $h$ (as introduced by vaswani_attention_2017) and three generated tokens ($\hat{x}_1, \hat{x}_2, \hat{x}_3$). Here, $\mathbf{h}^{(l)}_{i}$ represents the hidden state at layer $l$ for the input token $i$, while $a^{(l, h)}_{i, j}$ denotes the scalar attention score between tokens $i$ and $j$ at layer $l$ and attention head $h$. Arrows direction refers to information flow during inference.
  • Figure 3: Overview of the methodology used in this work. Solid lines indicate the test-time pipeline, while dashed lines represent additional pipeline steps for generating labels for training the hallucination probe (logistic regression). The primary contribution of this work is leveraging the top-$k$ eigenvalues of the Laplacian as features for the hallucination probe, highlighted with a bold box on the diagram.
  • Figure 4: Probe performance across different top-$k$ eigenvalues: $k \in \{5, 10, 25, 50, 100\}$ for TriviaQA dataset with $temp{=}1.0$ and $\texttt{Mistral-Small-24B}$ LLM.
  • Figure 5: Analysis of model performance across different layers for $\texttt{Mistral-Small-24B}$ and TriviaQA dataset with $temp{=}1.0$ and $k{=}100$ top eigenvalues (results for models operating on all layers provided for reference).
  • ...and 7 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Lemma 2
  • proof