Table of Contents
Fetching ...

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang

TL;DR

This paper systematically investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) truly enhances LLM reasoning or merely improves sampling efficiency. It introduces CoT-Pass@K to evaluate both final answers and intermediate reasoning, and provides a theoretical GRPO framework showing that correct CoTs become more likely under answer-based rewards when correct-CoT priors exist. The authors demonstrate extended reasoning boundaries in math and code tasks after RLVR and analyze training dynamics, showing early incentives for correct CoTs and generalization to unseen prompts, alongside improvements in CoT quality. They discuss limitations, including verifier costs and potential failure modes, and suggest that RLVR can be a foundation for more robust, verifiable reasoning in LLMs, with implications for live benchmarks and data-efficient learning via supervised fine-tuning. Overall, the work reconciles conflicting findings in prior RLVR studies and illuminates the mechanisms by which RLVR shapes reasoning behavior in LLMs.

Abstract

Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper systematically investigates the impact of RLVR on LLM reasoning. We revisit Pass@K experiments and demonstrate that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps. Furthermore, we present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness. Our analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process, with substantial improvements in reasoning quality confirmed through extensive evaluations. These findings provide strong evidence of RLVR's potential to enhance LLM reasoning, offering valuable insights into its mechanisms and performance improvements.

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

TL;DR

This paper systematically investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) truly enhances LLM reasoning or merely improves sampling efficiency. It introduces CoT-Pass@K to evaluate both final answers and intermediate reasoning, and provides a theoretical GRPO framework showing that correct CoTs become more likely under answer-based rewards when correct-CoT priors exist. The authors demonstrate extended reasoning boundaries in math and code tasks after RLVR and analyze training dynamics, showing early incentives for correct CoTs and generalization to unseen prompts, alongside improvements in CoT quality. They discuss limitations, including verifier costs and potential failure modes, and suggest that RLVR can be a foundation for more robust, verifiable reasoning in LLMs, with implications for live benchmarks and data-efficient learning via supervised fine-tuning. Overall, the work reconciles conflicting findings in prior RLVR studies and illuminates the mechanisms by which RLVR shapes reasoning behavior in LLMs.

Abstract

Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper systematically investigates the impact of RLVR on LLM reasoning. We revisit Pass@K experiments and demonstrate that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps. Furthermore, we present a theoretical framework explaining RLVR's incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness. Our analysis of RLVR's training dynamics reveals that it incentivizes correct reasoning early in the process, with substantial improvements in reasoning quality confirmed through extensive evaluations. These findings provide strong evidence of RLVR's potential to enhance LLM reasoning, offering valuable insights into its mechanisms and performance improvements.

Paper Structure

This paper contains 40 sections, 1 theorem, 63 equations, 10 figures.

Key Result

Theorem 1

For any prompt $q$ satisfying our assumptions, the expected GRPO advantage $\mathbb{E}[\hat{A}(y_i)]$ satisfies: where $\hat{A}(y_i)$ is defined in equation eq:grpo_advantage. The GRPO policy gradient, as defined in equation eq:gradient_update, increase the probability of generating correct CoTs ($p^\theta_c$) in the next round, so $p^\theta_c$ increases monotonically.

Figures (10)

  • Figure 1: An illustration of our perspective: RLVR implicitly incentivizes correct reasoning in base LLMs. We visualize how different explanation frameworks lead to varying reasoning paths being activated, with our perspective shown in the lower left and a recent popular hypothesis explaining Pass@K observations yue2025RLVR_limit summarized in the upper left. In this diagram, the line width represents the sampling probability of a reasoning path, while the color distinguishes correct paths (green) from incorrect ones (red). If all reasoning paths after applying RLVR are already present in the base model, the reasoning model merely adjusts the sampling probabilities of these existing paths (visualized in dashed lines). This hypothesis effectively accounts for the key observation shown in the upper-right part, where, for a moderately large $K$, a base LLM can catch up to the reasoning model after RLVR using the Pass@K metric. In this study, we unveil the extended reasoning capability boundary in math tasks using a refined metric, CoT-Pass@K, which emphasizes both the correctness of answers and the validity of reasoning CoTs.
  • Figure 2: Comparisons of Pass@K (the top row) and CoT-Pass@K (the bottom row) on five math benchmarks (different columns) to show how RLVR could improve base LLMs. Here the base LLM is Qwen2.5-32B, and the post-RLVR model is DAPO-Qwen-32B. For CoT-Pass@K, we perform multiple verifications for each CoT using DeepSeek-R1-0528-Qwen3-8B, and display the results determined by any-correct, all-correct, and majority-correct strategies, which constitute the shaded area in lower subplots.
  • Figure 3: Comparisons of Pass@K across six LiveCodeBench versions to show how much RLVR could enhance distilled LLMs. Here the distilled LLM is DeepSeek-R1-Distill-Qwen-7B, and the post-RLVR model is AceReason-Nemotron-7B.
  • Figure 4: The evolution of $P(CA)^{(q)}$ (the fraction of correct answers for prompt $q$) and $P(CC|CA)^{(q)}$ (the fraction of correct CoTs within the correct answers for prompt $q$) for fully optimized training questions over the course of DAPO training.
  • Figure 5: The evolution of Pass@K (the top row) and CoT-Pass@K (the bottom row) performance on AIME 2024 and 2025 for different model checkpoints during the DAPO training.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1: GRPO Implicitly Incentivizes Correct Reasoning
  • Proof 1