Table of Contents
Fetching ...

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Shiqi Chen, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, Junxian He

TL;DR

This work investigates hallucination in LLMs through the lens of inner representations, identifying in-context sharpness as a reliable signal for factuality. It introduces an entropy-based contextual sharpness metric and a constrained decoding method, Activation Decoding, which biases next-token predictions toward tokens with sharper in-context activations. Empirical results across TruthfulQA, TriviaQA, HotpotQA, and Natural Questions show that contextual entropy distinguishes true from false outputs (AUROC > 0.75) and that Activation Decoding yields consistent factuality gains across model sizes, with practical inference-time optimizations. The approach highlights a practical pathway for mitigating model-related hallucinations and enhances understanding of how hidden states encode factual knowledge, while acknowledging inherent trade-offs and scope limitations.

Abstract

Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

TL;DR

This work investigates hallucination in LLMs through the lens of inner representations, identifying in-context sharpness as a reliable signal for factuality. It introduces an entropy-based contextual sharpness metric and a constrained decoding method, Activation Decoding, which biases next-token predictions toward tokens with sharper in-context activations. Empirical results across TruthfulQA, TriviaQA, HotpotQA, and Natural Questions show that contextual entropy distinguishes true from false outputs (AUROC > 0.75) and that Activation Decoding yields consistent factuality gains across model sizes, with practical inference-time optimizations. The approach highlights a practical pathway for mitigating model-related hallucinations and enhances understanding of how hidden states encode factual knowledge, while acknowledging inherent trade-offs and scope limitations.

Abstract

Large language models (LLMs) frequently hallucinate and produce factual errors, yet our understanding of why they make these errors remains limited. In this study, we delve into the underlying mechanisms of LLM hallucinations from the perspective of inner representations, and discover a salient pattern associated with hallucinations: correct generations tend to have sharper context activations in the hidden states of the in-context tokens, compared to the incorrect ones. Leveraging this insight, we propose an entropy-based metric to quantify the ``sharpness'' among the in-context hidden states and incorporate it into the decoding process to formulate a constrained decoding approach. Experiments on various knowledge-seeking and hallucination benchmarks demonstrate our approach's consistent effectiveness, for example, achieving up to an 8.6 point improvement on TruthfulQA. We believe this study can improve our understanding of hallucinations and serve as a practical solution for hallucination mitigation.
Paper Structure (38 sections, 5 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 38 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualization of why in-context activation can serve as an alarming signal for factuality. For a given question (e.g., "Fabrizio Spada passed away in __"), we visualize the activation of the truth and false tokens across transformer layers. Left: we use the ground truth and false answers from CounterFact (e.g., "Rome" or "Manila") as the target tokens. In this example, the model generates the correct answer. Right: we use the ground truth answer and the model's generated false answer. We then calculate the activation entropy across intermediate layers, focusing on the $26$-th layer's entropy value (detailed calculation in \ref{['sec:finding2']}). This entropy metric is annotated in the figure. Our findings reveal that incorrect tokens generally exhibit higher entropy than correct ones.
  • Figure 2: Entropy distribution for ground truth and false answers in the GF-CFT dataset, computed using hidden states after the 28th and 26th layers.
  • Figure 3: AUROC score on GF-CFT and Raw-CFT among different baselines. Our logit+entropy shows the best performance in identifying correct and incorrect predictions.
  • Figure 4: Overview of our Activation Decoding method. Given the prompt, the direct decoding (i.e., greedy decoding) algorithm generates the wrong answer 'Hearing'. Here we show how our method can successfully encourage the correct answer 'Smell' to be decoded. Considering the correct token 'Smell' as an example, 1) we first calculate its activation scores to each in-context token using Eq. \ref{['eq:agreement_score']}. Note that it exhibits strong activation when processed with the in-context token 'Hyposmia'. 2) We then aggregate these corresponding activation scores together and normalize them into a distribution using Eq. \ref{['eq:agreement_probability']} to measure the in-context sharpness. Here the correct token with strong activation has a larger sharpness. 3) We use contextual entropy (Eq. \ref{['eq:entropy']}) to quantify the sharpness. 4) This entropy is then used as a penalty term to adjust the original token likelihood distribution, boosting its probability of being decoded.
  • Figure 5: Representative examples demonstrating our improvements in output quality. Compared to the 'base' (greedy decoding), our approach enhances model informativeness (Q1), recognizes biased assumptions, and provides objective responses (Q2). Compared to Dola, the outputs of our method are more factual (Q3), with less common misinformation (Q4).
  • ...and 2 more figures