Table of Contents
Fetching ...

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen, Nitesh V. Chawla, Binh T. Nguyen, Khoa D. Doan

TL;DR

This work analyzes why Reinforcement Learning with Verifiable Rewards (RLVR) can shrink the reasoning boundary of large language models rather than expand it. It identifies two core dynamics—negative interference across problems and a winner-take-all reinforcement of high-likelihood solutions under on-policy learning—that drive coverage collapse at larger Pass@$k$ budgets. To address this, the authors propose SELF, a data-curation strategy that focuses learning on low-likelihood problems and replaces Reverse KL with Forward KL to maintain diversity, yielding improved Pass@$k$ performance on multiple mathematical reasoning benchmarks. The results demonstrate that selective training can both improve efficiency and recover coverage, offering a practical path to more robust reasoning in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

TL;DR

This work analyzes why Reinforcement Learning with Verifiable Rewards (RLVR) can shrink the reasoning boundary of large language models rather than expand it. It identifies two core dynamics—negative interference across problems and a winner-take-all reinforcement of high-likelihood solutions under on-policy learning—that drive coverage collapse at larger Pass@ budgets. To address this, the authors propose SELF, a data-curation strategy that focuses learning on low-likelihood problems and replaces Reverse KL with Forward KL to maintain diversity, yielding improved Pass@ performance on multiple mathematical reasoning benchmarks. The results demonstrate that selective training can both improve efficiency and recover coverage, offering a practical path to more robust reasoning in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@ performance, or the probability of generating a correct solution within attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.

Paper Structure

This paper contains 36 sections, 3 theorems, 34 equations, 15 figures, 3 tables.

Key Result

Proposition 3.1

Consider a problem-solutions pair $(\boldsymbol{x},\boldsymbol{y})$ for updating using policy gradient objective, a sufficiently small learning rate $\eta$, the per-step influence on a problem-solution pair $(\boldsymbol{x}',\boldsymbol{y}')$ is defined as: where $\mathcal{K}^t(\boldsymbol{x},\boldsymbol{x}',\boldsymbol{y},\boldsymbol{y}')=\nabla_\theta\log\pi_{\theta^t}(\boldsymbol{y}|\boldsymbo

Figures (15)

  • Figure 1: Pass@$k$ evolution (smoothed) during RLVR training with Qwen2.5-Math-1.5B, comparing our proposed finetuning objective SELF to GRPO under a large sampling budget $k$. While GRPO shows a progressive decline, SELF exhibits consistent improvements in Pass@$k$ throughout the training process.
  • Figure 2: Learning dynamics of RLVR with key trends: (A) RLVR tends to improve average accuracy, but reduce the coverage of solvable solutions, as measured by Pass@256 (averaged across 4 test benchmarks); (B) the increasing effect of influence strength as training progresses; (C) the increase in negative interference; and (D) the decline of model confidence on previously correct solutions.
  • Figure 3: Interference as an indicator of Pass@$k$ decrease across various benchmarks.
  • Figure 4: Perplexity during RLVR training. Leftmost: the data sampled from each intermediate checkpoint $\pi_{\theta^t}$ exhibits an increasingly high model confidence under $\pi_b$. Middle: RLVR models $\pi_{\theta^t}$ exhibit reduced confidence in data previously generated by the base model, regardless of their correctness. Rightmost: problems that RLVR improves already have a high likelihood of generating correct solutions under $\pi_b$, while coverage-reduced problems initially have a low likelihood of producing correct answers.
  • Figure 5: An illustration describing the dynamic of on-policy learning in Eq. \ref{['eq:reinforce_grad']}. Leftmost: correct response $\boldsymbol{y}^+$ in low-likelihood regions induce minimal effect. Middle: with multiple correct responses $\boldsymbol{y}_1^+$ and $\boldsymbol{y}_2^+$, updates favor the one with higher initial likelihood. Rightmost: negative gradients on incorrect response $\boldsymbol{y}^-$ can raise correct ones $\boldsymbol{y}^+$, but greedy responses $\boldsymbol{y}^*$ increase the most.
  • ...and 10 more figures

Theorems & Definitions (6)

  • Proposition 3.1
  • Definition 4.1: Interference in Language Model
  • Proposition C.1
  • proof
  • Lemma D.1
  • proof