First Hallucination Tokens Are Different from Conditional Ones
Jakob Snel, Seong Joon Oh
TL;DR
The paper addresses token-level hallucination detection by investigating how detection signals vary across tokens within a hallucinated span. It augments the RAGTruth dataset with per-token logits and performs a position-aware analysis using in-span index $k$ and span index $j$, evaluating detectability with AUROC and separability with Min-K metrics. The main finding is that the first hallucination token ($k=0$) is consistently more detectable and separable than later, conditional tokens, with entropy being the strongest signal, though no single logit-derived feature remains robust across all positions. The work highlights the need for richer internal signals and position-aware approaches to enable robust, interpretable, token-level hallucination detection suitable for real-time filtering and targeted correction.
Abstract
Large Language Models (LLMs) hallucinate, and detecting these cases is key to ensuring trust. While many approaches address hallucination detection at the response or span level, recent work explores token-level detection, enabling more fine-grained intervention. However, the distribution of hallucination signal across sequences of hallucinated tokens remains unexplored. We leverage token-level annotations from the RAGTruth corpus and find that the first hallucinated token is far more detectable than later ones. This structural property holds across models, suggesting that first hallucination tokens play a key role in token-level hallucination detection. Our code is available at https://github.com/jakobsnl/RAGTruth_Xtended.
