LLM Internal States Reveal Hallucination Risk Faced With a Query

Ziwei Ji; Delong Chen; Etsuko Ishii; Samuel Cahyawijaya; Yejin Bang; Bryan Wilie; Pascale Fung

LLM Internal States Reveal Hallucination Risk Faced With a Query

Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, Pascale Fung

TL;DR

This work investigates whether LLM internal states encode self-awareness about hallucination risk, proposing a probing estimator that predicts two signals—whether a query has been seen in training data and whether the model will hallucinate—directly from internal representations. Using last-token activations from a specified layer and an LLM-informed MLP, the approach achieves strong runtime performance and up to about $84.32\%$ average accuracy across 15 NLG tasks, demonstrating that deeper layers carry increasingly informative cues about uncertainty. The methodology combines data construction that separates seen/unseen queries and hallucination risk across 700+ datasets, mutual-information-based neuron analysis, and multi-metric labeling (Rouge-L, NLI, QuestEval), with evaluation showing task-dependent gains and cross-model transfer limitations. Practically, this internal-state estimator offers a fast signal that could guide retrieval augmentation or risk-aware generation, while highlighting the importance of model-specific signals and the need for broader LLM coverage and human-in-the-loop validation in future work.

Abstract

The hallucination problem of Large Language Models (LLMs) significantly limits their reliability and trustworthiness. Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries. Inspired by this, our paper investigates whether LLMs can estimate their own hallucination risk before response generation. We analyze the internal mechanisms of LLMs broadly both in terms of training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. Our empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. Our study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32\% at run time.

LLM Internal States Reveal Hallucination Risk Faced With a Query

TL;DR

average accuracy across 15 NLG tasks, demonstrating that deeper layers carry increasingly informative cues about uncertainty. The methodology combines data construction that separates seen/unseen queries and hallucination risk across 700+ datasets, mutual-information-based neuron analysis, and multi-metric labeling (Rouge-L, NLI, QuestEval), with evaluation showing task-dependent gains and cross-model transfer limitations. Practically, this internal-state estimator offers a fast signal that could guide retrieval augmentation or risk-aware generation, while highlighting the importance of model-specific signals and the need for broader LLM coverage and human-in-the-loop validation in future work.

Abstract

Paper Structure (43 sections, 2 equations, 8 figures, 4 tables)

This paper contains 43 sections, 2 equations, 8 figures, 4 tables.

Introduction
Hallucination and Training Data
Methodology
Problem Formulation
Data Construction
(1) Seen/Unseen Query in Training Data
(2) Hallucination Risk faced with the Query
Preliminary Analysis: Neurons for Hallucination Perception from Internal States
Internal State-based Estimator
Experiments
LLM
Baselines
Zero-shot Prompt
In-Context-Learning (ICL) Prompt
Perplexity (PPL)
...and 28 more sections

Figures (8)

Figure 1: Humans have self-awareness and recognize uncertainties when confronted with unknown questions. LLM internal states reveal uncertainty even before responding. Pink dots are the internal LLM states associated with hallucinated responses, whereas Blue dots are those of faithful responses. The queries leading to those LLM responses are colored accordingly.
Figure 2: Visualization of the Neurons for Hallucination Perception in various NLG tasks. Pink dots represent Unknown Queries triggering hallucinations and Blue dots represent Known Queries.
Figure 3: Automatic evaluation results for our method and baselines including Perplexity (PPL), Zero-shot Prompt, and In Context Learning (ICL) Prompt.
Figure 4: F1 scores of Internal-State from Different Layers for Hallucination Estimation.
Figure 5: Inference time of various estimation methods.
...and 3 more figures

LLM Internal States Reveal Hallucination Risk Faced With a Query

TL;DR

Abstract

LLM Internal States Reveal Hallucination Risk Faced With a Query

Authors

TL;DR

Abstract

Table of Contents

Figures (8)