Table of Contents
Fetching ...

Are Reasoning Models More Prone to Hallucination?

Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, Tat-Seng Chua

TL;DR

This work investigates whether reasoning-enabled models are more prone to hallucination by systematically evaluating LRMs under different post-training pipelines on fact-seeking benchmarks. It shows that a complete cold-start SFT followed by verifiable RL reduces hallucination, whereas RL-only or SFT-only pipelines often increase factual errors. The authors identify two cognitive failure modes—Flaw Repetition and Think-Answer Mismatch—and demonstrate that hallucination correlates with miscalibrated internal uncertainty, which probing can reveal. By analyzing calibration and uncertainty, the paper offers practical guidance for safer reasoning models and highlights the value of combining SFT with RL in reducing factual errors. The findings suggest uncertainty-aware monitoring as a promising direction for trustworthy long-CoT systems in real-world use.

Abstract

Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.

Are Reasoning Models More Prone to Hallucination?

TL;DR

This work investigates whether reasoning-enabled models are more prone to hallucination by systematically evaluating LRMs under different post-training pipelines on fact-seeking benchmarks. It shows that a complete cold-start SFT followed by verifiable RL reduces hallucination, whereas RL-only or SFT-only pipelines often increase factual errors. The authors identify two cognitive failure modes—Flaw Repetition and Think-Answer Mismatch—and demonstrate that hallucination correlates with miscalibrated internal uncertainty, which probing can reveal. By analyzing calibration and uncertainty, the paper offers practical guidance for safer reasoning models and highlights the value of combining SFT with RL in reducing factual errors. The findings suggest uncertainty-aware monitoring as a promising direction for trustworthy long-CoT systems in real-world use.

Abstract

Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.

Paper Structure

This paper contains 18 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Comparison of factual accuracy between LRMs and their backbone counterparts, along with illustrative examples of LRM hallucinations. (a) Accuracy comparison on two fact-seeking benchmarks (SimpleQA and TriviaQA), with relative changes denoted on the bars. It shows that LRMs seem to suffer from factuality degradation compared to their backbone models, especially on SimpleQA, except DeepSeek-R1. (b) Examples of hallucinated answers from two LRMs: DeepSeek-Qwen-Distill-32B and Qwen3-32B provide incorrect responses to queries about model information. In reality, Qwen-2 was released in June 2024 with no 3.5 billion variant, and DeepSeek-V3 was released in December 2024 with 671 billion parameters.
  • Figure 2: Calibration plot comparing LRMs with their non-reasoning counterparts on TriviaQA. Each plot visualizes the relationship between model confidence $P(a)$—estimated via sampling and majority voting—and the actual correctness probability $P(c|a)$ judged by an external LLM. Models closer to the diagonal with lower Expected Calibration Error (ECE) are better calibrated.