RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection
Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Xuming Hu, Yi R. Fung, Xinlei He
TL;DR
RePPL addresses hallucinations in large language models by recalibrating uncertainty from two sources: semantic-propagation uncertainty captured via token attribution across sampled generations (InnerPPL) and language-generation uncertainty via token confidences (OuterPPL). The two components are combined multiplicatively into a final RePPL score, yielding both strong detection performance and token-level explanations of hallucination triggers. Empirical results on four QA datasets across multiple instruction-tuned models show an average AUC of approximately 0.833 for all-datasets, outperforming several baselines and providing robust, explainable uncertainty signals. The work advances explainable QA hallucination detection and offers a practical, non-parametric approach suitable for zero-shot analysis with interpretable token-level cues.
Abstract
Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. Previous works improved the capability of hallucination detection by measuring uncertainty. But they can not explain the provenance behind why hallucinations occur, particularly in identifying which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as a total score. Experiments show that it achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and it is capable of producing token-level uncertainty scores as explanations of hallucination.
