Table of Contents
Fetching ...

Prompt-Guided Internal States for Hallucination Detection of Large Language Models

Fujie Zhang, Peiqi Yu, Biao Yi, Baolei Zhang, Tong Li, Zheli Liu

TL;DR

This work tackles LLM hallucination detection across domains by addressing limited cross-domain generalization of supervised detectors. It introduces PRISM, a prompt-guided internal-states framework that steers LLM representations toward a truthfulness-related structure, quantified by metrics such as the variance ratio $R = V_theta / V_T$ and directional consistency. By generating and selecting effective prompts, PRISM yields internal-state features that, when integrated with existing detectors, substantially improve cross-domain performance on True-False and LogicStruct datasets and show gains on TruthfulQA. The results support the practical value of leveraging prompt-induced internal representations to build more generalizable hallucination detectors, with ablations demonstrating the importance of prompt design, layer choice, and real-world applicability.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of tasks in different domains. However, they sometimes generate responses that are logically coherent but factually incorrect or misleading, which is known as LLM hallucinations. Data-driven supervised methods train hallucination detectors by leveraging the internal states of LLMs, but detectors trained on specific domains often struggle to generalize well to other domains. In this paper, we aim to enhance the cross-domain performance of supervised detectors with only in-domain data. We propose a novel framework, prompt-guided internal states for hallucination detection of LLMs, namely PRISM. By utilizing appropriate prompts to guide changes to the structure related to text truthfulness in LLMs' internal states, we make this structure more salient and consistent across texts from different domains. We integrated our framework with existing hallucination detection methods and conducted experiments on datasets from different domains. The experimental results indicate that our framework significantly enhances the cross-domain generalization of existing hallucination detection methods.

Prompt-Guided Internal States for Hallucination Detection of Large Language Models

TL;DR

This work tackles LLM hallucination detection across domains by addressing limited cross-domain generalization of supervised detectors. It introduces PRISM, a prompt-guided internal-states framework that steers LLM representations toward a truthfulness-related structure, quantified by metrics such as the variance ratio and directional consistency. By generating and selecting effective prompts, PRISM yields internal-state features that, when integrated with existing detectors, substantially improve cross-domain performance on True-False and LogicStruct datasets and show gains on TruthfulQA. The results support the practical value of leveraging prompt-induced internal representations to build more generalizable hallucination detectors, with ablations demonstrating the importance of prompt design, layer choice, and real-world applicability.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of tasks in different domains. However, they sometimes generate responses that are logically coherent but factually incorrect or misleading, which is known as LLM hallucinations. Data-driven supervised methods train hallucination detectors by leveraging the internal states of LLMs, but detectors trained on specific domains often struggle to generalize well to other domains. In this paper, we aim to enhance the cross-domain performance of supervised detectors with only in-domain data. We propose a novel framework, prompt-guided internal states for hallucination detection of LLMs, namely PRISM. By utilizing appropriate prompts to guide changes to the structure related to text truthfulness in LLMs' internal states, we make this structure more salient and consistent across texts from different domains. We integrated our framework with existing hallucination detection methods and conducted experiments on datasets from different domains. The experimental results indicate that our framework significantly enhances the cross-domain generalization of existing hallucination detection methods.

Paper Structure

This paper contains 22 sections, 8 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Visualization of the LLaMA2-7B-Chat model's internal states by using the 2-dimensional PCA on the True-False dataset before and after the introduction of Prompt 1, where label 0 represents false statements and label 1 represents true statements. Additionally, a logistic regression model was fitted to distinguish between true and false statements, with the decision boundary shown as a black dashed line.
  • Figure 2: Comparison of the variance ratios before and after the introduction of Prompt 1 on each sub-dataset of the True-False dataset.
  • Figure 3: Comparison of the variance ratios before and after the introduction of Prompt 3 on each sub-dataset of the True-False dataset.