On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

Xueru Wen; Jie Lou; Xinyu Lu; Ji Yuqiu; Xinyan Guan; Yaojie Lu; Hongyu Lin; Ben He; Xianpei Han; Debing Zhang; Le Sun

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

Xueru Wen, Jie Lou, Xinyu Lu, Ji Yuqiu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, Le Sun

TL;DR

This paper tackles hallucination in large language models by proposing RLFH, an on-policy self-alignment framework that enables models to explore and correct their own knowledge boundaries. It introduces a self-assessment pipeline that decomposes responses into atomic statements, verifies them against external sources, and assigns token-level dense rewards for online reinforcement learning using PPO. Key contributions include a statement-level truthfulness and informativeness evaluation, a dense reward conversion via an LCS-based mapping, and comprehensive demonstrations on HotpotQA, SQuADv2, and Biography showing improved factuality and generalization to OOD prompts. The results suggest that fine-grained, on-policy feedback can produce more reliable, information-rich outputs while reducing hallucinations, with the caveat of potential limitations in broader domain coverage and automated fact-checking reliability.

Abstract

Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH's effectiveness in hallucination mitigation.

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 10 figures, 5 tables)

This paper contains 35 sections, 5 equations, 10 figures, 5 tables.

Introduction
Related Works
Hallucination Mitigation
Reinforcement Learning from Human Feedback
Reinforcement Learning for Hallucination
Fine-grained Feedback from Policy as the Judge
Statement Extraction
Factual Verification
Informativeness Assessment
On-Policy Optimization with Token-level Reward
Dense Reward Conversion
Truthfulness
Informativeness
Online Reinforcement Learning
Experiment
...and 20 more sections

Figures (10)

Figure 1: The figure illustrates the hallucinatory case and several hallucination mitigation methodologies. The factual information within the text is underlined. False content is highlighted in red, whereas accurate facts are indicated in blue. Statements with uncertain veracity are marked in orange.
Figure 2: A diagram illustrating the steps of our algorithm: (1) Sampling response from tuning model, (2) Policy acting as a judge model performing self-assessment to collect fine-grained knowledge feedback, and (3) Converting the language-form feedback into token-level dense reward for reinforcement learning.
Figure 3: A schematic representation of fine-grained feedback and token-level reward strategy methodology is presented. Initially, the statements are extracted in a hierarchical fashion. Subsequently, the veracity and utility of each statement are assessed. Ultimately, the structured feedback is mapped back into a dense reward via the Longest Common Subsequence (LCS) algorithm.
Figure 4: Distribution of statement accuracy versus count per response for Qwen2.5-7B-Instruct, comparing the base model and RLFH-tuned model.
Figure 5: Distribution of statements per response across different truthfulness categories, comparing base Qwen2.5-7B-Instruct and its RLFH-tuned version. The distributions are normalized due to the filtering of rejected responses.
...and 5 more figures

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

TL;DR

Abstract

On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)