Table of Contents
Fetching ...

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

Litian Liu, Reza Pourreza, Yubing Jian, Yao Qin, Roland Memisevic

TL;DR

This work revisits hallucination detection through the lens of out-of-distribution (OOD) detection, and suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

Abstract

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

TL;DR

This work revisits hallucination detection through the lens of out-of-distribution (OOD) detection, and suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

Abstract

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.
Paper Structure (33 sections, 2 theorems, 26 equations, 1 figure, 7 tables)

This paper contains 33 sections, 2 theorems, 26 equations, 1 figure, 7 tables.

Key Result

Theorem 3.3

Adapted from liu2024fast. Given embedding $\bm{z}$ and token $c \in \mathcal{V}, c \neq \arg\max_{v \in \mathcal{V}} \bm{w}_v^\top \bm{z} + b_v$, $D_f(\bm{z}, c)$ is lower bounded by See proof in Appendix sec:appendix_proof.

Figures (1)

  • Figure 1: OOD-inspired geometric uncertainty measures can detect hallucinations. (a) Embeddings from hallucinated responses exhibit less proximity to weight vectors, extending OOD detector NCI liu2025detecting. (b) Embeddings from hallucinated responses exhibit smaller distance to decision boundaries than correct embeddings, extending OOD detector fDBD liu2024fast. (a) Left and (b) Left illustrate the proximity score and distance to the decision boundaries defined in Definition \ref{['def:pScore']} and Definition \ref{['def:uniDistanceLLM']}, respectively. (a) Right and (b) Right show histograms for the corresponding uncertainty measures based on the CSQA dataset on Llama-3.2-3B-Instruct.

Theorems & Definitions (6)

  • Definition 3.1: Feature Proximity to Weight Vectors
  • Definition 3.2: Distance to Decision Boundary
  • Theorem 3.3: Approximate Distance to Decision Boundary
  • Lemma 4.1: Analytical Solution for Decision-Neutral Closest Point
  • proof
  • Definition 4.1: Decision-Neutral Closest Point