Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models
Junyi Li, Hwee Tou Ng
TL;DR
This work identifies that RL fine-tuning for large reasoning models exacerbates hallucinations due to high-variance policy gradients, entropy-driven exploration, and spurious local optima under outcome-based rewards. To counter this, it introduces Factuality-aware Step-wise Policy Optimization (FSPO), which injects step-wise factual verification into the RL loop to adjust token-level advantages toward factually grounded reasoning. FSPO combines an entailment-based step-wise factuality reward with the final answer reward and reweights token advantages per step, enabling denser and more informative feedback than final-outcome signals. Empirical results across mathematical reasoning and hallucination benchmarks on Qwen2.5 and Llama models show that FSPO reduces hallucinations while improving reasoning accuracy, outperforming several open-source baselines and demonstrating robustness across tasks. The approach advances the reliability of RL-tuned reasoning systems by explicitly prioritizing verifiable reasoning steps over merely correct final answers.
Abstract
Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.
