Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Shuo Nie; Hexuan Deng; Chao Wang; Ruiyu Fang; Xuebo Liu; Shuangyong Song; Yu Li; Min Zhang; Xuelong Li

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li

TL;DR

This work tackles faithfulness hallucinations in small reasoning models (SRMs) during chain-of-thought reasoning by highlighting that outcome-based rewards fail to penalize unfaithful intermediate steps. It proposes FaithRL, a reinforcement learning framework that combines implicit step-level rewards via Dynamic Truncated Resampling (DTR) with explicit step-level rewards derived from a Process Reward Model (PRM) to penalize unfaithful CoT steps while rewarding faithful prefixes. Empirical results across multiple SRMs and Open-Book QA benchmarks show FaithRL consistently reduces faithfulness hallucinations in both CoT and final answers, outperforming baselines while maintaining training efficiency. The approach offers a practical path toward more trustworthy and reliable SRMs by jointly optimizing task accuracy and reasoning faithfulness, with code available for reproducibility at the provided repository URL.

Abstract

As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

TL;DR

Abstract

Paper Structure (53 sections, 7 equations, 3 figures, 8 tables)

This paper contains 53 sections, 7 equations, 3 figures, 8 tables.

Introduction
Related Work
Hallucinations in LLMs.
Mitigating Factual Hallucination.
Mitigating Faithfulness Hallucination.
Analyzing Faithfulness Hallucinations in SRMs
Preliminaries
Faithfulness Hallucinations in Open-Book Question Answering.
Datasets and Metrics.
Experimental Results and Analysis
Severe Faithfulness Hallucinations in SRMs.
More Severe CoT Faithfulness Hallucinations than Answer in SRMs.
Faithfulness Hallucinations within the CoT Result in Incorrect Answers.
Our Proposed Method
Implicit Step-Level Rewards with Dynamic Truncated Resampling
...and 38 more sections

Figures (3)

Figure 1: For small reasoning models, faithfulness hallucinations in the CoT and the final answer are often inconsistent, making outcome-only rewards ineffective. Figure shows an example on NewsQA with DeepSeek-R1-Distill-Qwen-1.5B (DPSK-1.5B): although the final answer is correct, faithfulness hallucinations still occur within the CoT. By constructing new questions based on hallucinated CoT steps, we find that the model frequently produces incorrect answers, indicating that faithfulness hallucinations in the CoT can trigger further errors.
Figure 2: An overview of the FaithRL framework. The left part illustrates the Implicit Step-Level Rewards with Dynamic Truncated Resampling. The model performs sentence-level faithfulness detection on the CoT generated by the policy model. When a faithfulness hallucination is detected, the reasoning process is truncated, and the faithful CoT sentences preceding it are used as a prefix for resampling. The right part depicts Explicit Step-Level Rewards with PRM. When the final answer is incorrect, the $R_{\text{answer}}$ is uniformly assigned to all tokens in the CoT and final answer. When the answer is correct, step-level rewards are applied to individual CoT sentences, with each sentence-level reward propagated to its tokens. This design enables precise penalization of faithfulness hallucinations while preserving correct and faithful reasoning steps.
Figure 3: Curves of Key CoT Faithful Rate and Answer Faithful Rate during training on DPSK-1.5B evaluated on NewsQA.

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

TL;DR

Abstract

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)