Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States
Hanyu Duan, Yi Yang, Kar Yan Tam
TL;DR
The paper investigates whether LLMs are aware of hallucination by examining hidden-state representations when faced with correct versus hallucinatory answers. It introduces a dual-input framework and analyzes three final hidden states (s1,s2,s3) across LLaMA-2 models on TruthfulQA and HaluEval, deriving a quantitative awareness score via cosine similarities and PCA-derived truthfulness directions. The results show positive awareness and reveal how prompting, external knowledge, and middle-layer information influence detection, and demonstrate a case study where guidance from the correct-transition direction can mitigate hallucinations. These findings offer a mechanism-level understanding of hallucinations and propose a practical route—activation-engineering-guided hidden-space signals—for reducing them in critical applications. The work thus advances interpretability and safety in large language models by linking internal representations to truthfulness and mitigation outcomes.
Abstract
Large Language Models (LLMs) can make up answers that are not real, and this is known as hallucination. This research aims to see if, how, and to what extent LLMs are aware of hallucination. More specifically, we check whether and how an LLM reacts differently in its hidden states when it answers a question right versus when it hallucinates. To do this, we introduce an experimental framework which allows examining LLM's hidden states in different hallucination situations. Building upon this framework, we conduct a series of experiments with language models in the LLaMA family (Touvron et al., 2023). Our empirical findings suggest that LLMs react differently when processing a genuine response versus a fabricated one. We then apply various model interpretation techniques to help understand and explain the findings better. Moreover, informed by the empirical observations, we show great potential of using the guidance derived from LLM's hidden representation space to mitigate hallucination. We believe this work provides insights into how LLMs produce hallucinated answers and how to make them occur less often.
