Table of Contents
Fetching ...

Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States

Hanyu Duan, Yi Yang, Kar Yan Tam

TL;DR

The paper investigates whether LLMs are aware of hallucination by examining hidden-state representations when faced with correct versus hallucinatory answers. It introduces a dual-input framework and analyzes three final hidden states (s1,s2,s3) across LLaMA-2 models on TruthfulQA and HaluEval, deriving a quantitative awareness score via cosine similarities and PCA-derived truthfulness directions. The results show positive awareness and reveal how prompting, external knowledge, and middle-layer information influence detection, and demonstrate a case study where guidance from the correct-transition direction can mitigate hallucinations. These findings offer a mechanism-level understanding of hallucinations and propose a practical route—activation-engineering-guided hidden-space signals—for reducing them in critical applications. The work thus advances interpretability and safety in large language models by linking internal representations to truthfulness and mitigation outcomes.

Abstract

Large Language Models (LLMs) can make up answers that are not real, and this is known as hallucination. This research aims to see if, how, and to what extent LLMs are aware of hallucination. More specifically, we check whether and how an LLM reacts differently in its hidden states when it answers a question right versus when it hallucinates. To do this, we introduce an experimental framework which allows examining LLM's hidden states in different hallucination situations. Building upon this framework, we conduct a series of experiments with language models in the LLaMA family (Touvron et al., 2023). Our empirical findings suggest that LLMs react differently when processing a genuine response versus a fabricated one. We then apply various model interpretation techniques to help understand and explain the findings better. Moreover, informed by the empirical observations, we show great potential of using the guidance derived from LLM's hidden representation space to mitigate hallucination. We believe this work provides insights into how LLMs produce hallucinated answers and how to make them occur less often.

Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States

TL;DR

The paper investigates whether LLMs are aware of hallucination by examining hidden-state representations when faced with correct versus hallucinatory answers. It introduces a dual-input framework and analyzes three final hidden states (s1,s2,s3) across LLaMA-2 models on TruthfulQA and HaluEval, deriving a quantitative awareness score via cosine similarities and PCA-derived truthfulness directions. The results show positive awareness and reveal how prompting, external knowledge, and middle-layer information influence detection, and demonstrate a case study where guidance from the correct-transition direction can mitigate hallucinations. These findings offer a mechanism-level understanding of hallucinations and propose a practical route—activation-engineering-guided hidden-space signals—for reducing them in critical applications. The work thus advances interpretability and safety in large language models by linking internal representations to truthfulness and mitigation outcomes.

Abstract

Large Language Models (LLMs) can make up answers that are not real, and this is known as hallucination. This research aims to see if, how, and to what extent LLMs are aware of hallucination. More specifically, we check whether and how an LLM reacts differently in its hidden states when it answers a question right versus when it hallucinates. To do this, we introduce an experimental framework which allows examining LLM's hidden states in different hallucination situations. Building upon this framework, we conduct a series of experiments with language models in the LLaMA family (Touvron et al., 2023). Our empirical findings suggest that LLMs react differently when processing a genuine response versus a fabricated one. We then apply various model interpretation techniques to help understand and explain the findings better. Moreover, informed by the empirical observations, we show great potential of using the guidance derived from LLM's hidden representation space to mitigate hallucination. We believe this work provides insights into how LLMs produce hallucinated answers and how to make them occur less often.
Paper Structure (12 sections, 12 figures, 13 tables)

This paper contains 12 sections, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Experimental framework, including two independent inputs (i.e., hallucinated input and correct input) and three critical hidden states ($\bm{s_1}$, $\bm{s_2}$, and $\bm{s_3}$). <Question> and <Answer> are templates that are adaptable and can be customized to suit various tasks.
  • Figure 2: Awareness score distributions. We mark the average score for each model in red.
  • Figure 3: Awareness score distributions across hallucination types (LLaMA-2 7B).
  • Figure 4: Awareness score distributions across prompting strategies (LLaMA-2 7B).
  • Figure 5: Awareness score distributions with and without reference knowledge provided.
  • ...and 7 more figures