LLM Internal States Reveal Hallucination Risk Faced With a Query
Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, Pascale Fung
TL;DR
This work investigates whether LLM internal states encode self-awareness about hallucination risk, proposing a probing estimator that predicts two signals—whether a query has been seen in training data and whether the model will hallucinate—directly from internal representations. Using last-token activations from a specified layer and an LLM-informed MLP, the approach achieves strong runtime performance and up to about $84.32\%$ average accuracy across 15 NLG tasks, demonstrating that deeper layers carry increasingly informative cues about uncertainty. The methodology combines data construction that separates seen/unseen queries and hallucination risk across 700+ datasets, mutual-information-based neuron analysis, and multi-metric labeling (Rouge-L, NLI, QuestEval), with evaluation showing task-dependent gains and cross-model transfer limitations. Practically, this internal-state estimator offers a fast signal that could guide retrieval augmentation or risk-aware generation, while highlighting the importance of model-specific signals and the need for broader LLM coverage and human-in-the-loop validation in future work.
Abstract
The hallucination problem of Large Language Models (LLMs) significantly limits their reliability and trustworthiness. Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries. Inspired by this, our paper investigates whether LLMs can estimate their own hallucination risk before response generation. We analyze the internal mechanisms of LLMs broadly both in terms of training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. Our empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. Our study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32\% at run time.
