Table of Contents
Fetching ...

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye

TL;DR

This work targets knowledge hallucination in LLMs by leveraging internal states rather than decoding outputs alone. It introduces INSIDE, a framework centered on EigenScore, a covariance-based semantic-divergence metric computed from sentence embeddings across multiple generations, and a test-time feature clipping strategy to curb overconfident outputs. The approach achieves state-of-the-art hallucination detection on multiple QA benchmarks across several open LLMs, and ablations demonstrate robustness to generation count and embedding choice while highlighting efficiency advantages. These findings suggest internal-state signals are a valuable resource for reliable LLM evaluation and could inform mitigation strategies.

Abstract

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

TL;DR

This work targets knowledge hallucination in LLMs by leveraging internal states rather than decoding outputs alone. It introduces INSIDE, a framework centered on EigenScore, a covariance-based semantic-divergence metric computed from sentence embeddings across multiple generations, and a test-time feature clipping strategy to curb overconfident outputs. The approach achieves state-of-the-art hallucination detection on multiple QA benchmarks across several open LLMs, and ablations demonstrate robustness to generation count and embedding choice while highlighting efficiency advantages. These findings suggest internal-state signals are a valuable resource for reliable LLM evaluation and could inform mitigation strategies.

Abstract

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.
Paper Structure (22 sections, 8 equations, 6 figures, 8 tables)

This paper contains 22 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of our proposed hallucination detection pipeline. During inference time, for a given question, the extreme features in the penultimate layer are truncated and the EigenScore is computed based on the sentence embeddings across multiple responses.
  • Figure 2: Illustration of activation distributions in the penultimate layer of LLaMA-7B. (a) Activation distribution in the penultimate layer for a randomly sampled token. (b) Activation distribution for a randomly sampled neuron activation of numerous tokens.
  • Figure 3: (a) Performance in LLaMA-7B and NQ dataset with different number of generations. (b) Performance in LLaMA-7B and CoQA dataset with sentence embedding in different layers. Orange line indicates using the last token's embedding in the middle layer (layer 17) as sentence embedding. Gray line indicates using the averaged token embedding in the last layer as sentence embedding. The performance is measured by $\text{AUROC}_s$.
  • Figure 4: (a) Performance sensitivity to temperature. (b) Performance sensitivity to top-k. The performance is measured by $\text{AUROC}_s$.
  • Figure 5: Inference cost comparison of different methods in LLaMA-7B and LLaMA-13B. BaseLLM denotes the LLM without using any hallucination detection metrics. LexicalSim denotes Lexical Similarity and SelfCKGPT denotes SelfCkeckGPT.
  • ...and 1 more figures