Table of Contents
Fetching ...

First Hallucination Tokens Are Different from Conditional Ones

Jakob Snel, Seong Joon Oh

TL;DR

The paper addresses token-level hallucination detection by investigating how detection signals vary across tokens within a hallucinated span. It augments the RAGTruth dataset with per-token logits and performs a position-aware analysis using in-span index $k$ and span index $j$, evaluating detectability with AUROC and separability with Min-K metrics. The main finding is that the first hallucination token ($k=0$) is consistently more detectable and separable than later, conditional tokens, with entropy being the strongest signal, though no single logit-derived feature remains robust across all positions. The work highlights the need for richer internal signals and position-aware approaches to enable robust, interpretable, token-level hallucination detection suitable for real-time filtering and targeted correction.

Abstract

Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.

First Hallucination Tokens Are Different from Conditional Ones

TL;DR

The paper addresses token-level hallucination detection by investigating how detection signals vary across tokens within a hallucinated span. It augments the RAGTruth dataset with per-token logits and performs a position-aware analysis using in-span index and span index , evaluating detectability with AUROC and separability with Min-K metrics. The main finding is that the first hallucination token () is consistently more detectable and separable than later, conditional tokens, with entropy being the strongest signal, though no single logit-derived feature remains robust across all positions. The work highlights the need for richer internal signals and position-aware approaches to enable robust, interpretable, token-level hallucination detection suitable for real-time filtering and targeted correction.

Abstract

Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.

Paper Structure

This paper contains 29 sections, 6 equations, 39 figures, 1 table.

Figures (39)

  • Figure 1: First Hallucination Tokens Are Different: We visualise three tokenised model responses from RAGTruth, overlaid with normalised logit entropy magnitudes. Tokens that are annotated as hallucination are highlighted with red outlines. The first hallucinated token exhibits higher entropy characteristics compared to conditional hallucinated tokens. This pattern holds consistently across different models, hallucination positions, and contexts. [model: llama-2-13b-chat, id: 214, 64, 730]
  • Figure 2: First Hallucination Tokens Are Better Detectable: We show AUROC scores per signal and in-span hallucination token index across all hallucination spans. We report both global and averaged response-level scores. For the latter, we add error bars to account for the score distribution across different responses. Per analysis level and model, we invert AUROC scores that are, averaged over all indices, below 0.5 on $\mathcal{T}^{\text{all}}$. [llama-2-13b-chat; all]
  • Figure 3: First Hallucination Tokens Exhibit Greater Separability: Min-10 probability distribution across different token categories and indices. Grey magnitudes are normalised across the entire category, while the numerical scores are not. Separability patterns are consistent across all percentiles in the range of 10 to 100 concerning token rankings (see appendix \ref{['plt:mink:perc']}). As the contrast is the greatest for the 10th percentile threshold, we choose it for visualisation. [llama-2-13b-chat; all]
  • Figure 4: [all] AUROC per signal and in-span hallucination token indices from all hallucination spans at both global and response level.
  • Figure 5: [first] AUROC per signal and in-span hallucination token indices from first hallucination spans at both global and response level.
  • ...and 34 more figures