Table of Contents
Fetching ...

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, Yarin Gal

TL;DR

This work introduces Semantic Entropy Probes (SEPs), cheap linear probes that predict semantic entropy from a single generation's hidden states to detect LLM hallucinations. By supervising probes on semantic entropy rather than accuracy, SEPs achieve strong generalization to out-of-distribution tasks and require no ground-truth correctness labels at test time. Across multiple models and datasets, SEPs capture semantic uncertainty within hidden representations, perform competitively with expensive sampling-based baselines, and show improved robustness to unseen tasks. The results suggest that internal model states encode semantic uncertainty, enabling cost-effective uncertainty quantification and reliable hallucination detection in practical deployments.

Abstract

We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

TL;DR

This work introduces Semantic Entropy Probes (SEPs), cheap linear probes that predict semantic entropy from a single generation's hidden states to detect LLM hallucinations. By supervising probes on semantic entropy rather than accuracy, SEPs achieve strong generalization to out-of-distribution tasks and require no ground-truth correctness labels at test time. Across multiple models and datasets, SEPs capture semantic uncertainty within hidden representations, perform competitively with expensive sampling-based baselines, and show improved robustness to unseen tasks. The results suggest that internal model states encode semantic uncertainty, enabling cost-effective uncertainty quantification and reliable hallucination detection in practical deployments.

Abstract

We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. (2024) proposes semantic entropy (SE), which can detect hallucinations by estimating uncertainty in the space semantic meaning for a set of model generations. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.
Paper Structure (20 sections, 6 equations, 21 figures, 4 tables)

This paper contains 20 sections, 6 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Semantic entropy probes (SEPs) outperform accuracy probes for hallucination detection with Llama-2-7B, although there is a gap to 10.0x costlier baselines. (See Sec. \ref{['sec:experiments']}.)
  • Figure 2: Semantic Entropy Probes (SEPs) achieve high fidelity for predicting semantic entropy. Across datasets and models, SEPs are consistently able to capture semantic entropy from hidden states of mid-to-late layers. Short generation scenario with probes trained on second-last token (SLT).
  • Figure 3: Semantic entropy can be predicted from the hidden states of the last input token, without generating any novel tokens. Short generations with SEPs trained on the token before generating (TBG).
  • Figure 4: SEPs successfully capture semantic entropy in Llama-2-70B and Llama-3-70B for long generations across layers and for both SLT and TBG token positions.
  • Figure 5: Fig. 5: SEPs capture drop in SE due to added context.
  • ...and 16 more figures