Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

Yuxin Liang; Zhuoyang Song; Hao Wang; Jiaxing Zhang

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

Yuxin Liang, Zhuoyang Song, Hao Wang, Jiaxing Zhang

TL;DR

The paper investigates whether LLMs possess and can express their internal knowledge state to mitigate factual hallucinations. It develops two core ideas: (i) knowledge state probing from external and internal perspectives, showing high linear-probe accuracy in internal representations, and (ii) DreamCatcher, a tool that fuses probing and consistency signals to label data by a factual-preference hierarchy $factuality > uncertainty > hallucination$. Building on this, RLKF (Reinforcement Learning from Knowledge Feedback) trains a reward model from DreamCatcher labels and optimizes the base model with PPO to improve factuality and honesty, achieving gains across knowledge and reasoning tasks and reducing the alignment tax typically seen with RLHF. The results demonstrate that leveraging internal knowledge state can reduce hallucinations and improve reliability, withDreamCatcher achieving ~81% agreement with human judgments and RLKF delivering measurable improvements on benchmarks like MMLU, GSM8K, MATH, and TruthfulQA. Overall, the work offers a practical framework for enhancing LLM trustworthiness by aligning generation with verifiable internal knowledge, potentially reducing the need for external retrieval systems in many settings.

Abstract

We evaluate the ability of Large Language Models (LLMs) to discern and express their internal knowledge state, a key factor in countering factual hallucination and ensuring reliable application of LLMs. We observe a robust self-awareness of internal knowledge state in LLMs, evidenced by over 85% accuracy in knowledge probing. However, LLMs often fail to express their internal knowledge during generation, leading to factual hallucinations. We develop an automated hallucination annotation tool, Dreamcatcher, which merges knowledge probing and consistency checking methods to rank factual preference data. Using knowledge preference as reward, We propose a Reinforcement Learning from Knowledge Feedback (RLKF) training framework, leveraging reinforcement learning to enhance the factuality and honesty of LLMs. Our experiments across multiple models show that RLKF training effectively enhances the ability of models to utilize their internal knowledge state, boosting performance in a variety of knowledge-based and honesty-related tasks.

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

TL;DR

. Building on this, RLKF (Reinforcement Learning from Knowledge Feedback) trains a reward model from DreamCatcher labels and optimizes the base model with PPO to improve factuality and honesty, achieving gains across knowledge and reasoning tasks and reducing the alignment tax typically seen with RLHF. The results demonstrate that leveraging internal knowledge state can reduce hallucinations and improve reliability, withDreamCatcher achieving ~81% agreement with human judgments and RLKF delivering measurable improvements on benchmarks like MMLU, GSM8K, MATH, and TruthfulQA. Overall, the work offers a practical framework for enhancing LLM trustworthiness by aligning generation with verifiable internal knowledge, potentially reducing the need for external retrieval systems in many settings.

Abstract

Paper Structure (18 sections, 1 equation, 4 figures, 7 tables)

This paper contains 18 sections, 1 equation, 4 figures, 7 tables.

Introduction
Problem Setup
Knowledge State Probing
External perspective
Internal perspective
DreamCatcher
Method
Experiments
Data collection
RLKF Training
Related Work
Conclusion
Appendix
Example of wiki-QA Instruction
More probing results
...and 3 more sections

Figures (4)

Figure 1: Internal knowledge state categorization of LLM, based on the possession of corresponding internal knowledge and the capacity to express it honestly.
Figure 2: Accuracy of knowledge state probing across different models with different internal representations. The light-colored area in the figure shows the range of accuracy for ten repetitions of the experiment, and the solid line shows the mean accuracy. More results shown in \ref{['sec:More probing results']}
Figure 3: RLKF training
Figure 4: Accuracy of knowledge state probing in 7B models. The light-colored area in the figure shows the range of accuracy for ten repetitions of the experiment, and the solid line shows the mean accuracy.

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

TL;DR

Abstract

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)