Table of Contents
Fetching ...

The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi

TL;DR

The paper investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) truly expands a base model's reasoning or merely reinforces existing high-reward patterns. It introduces an empirical support framework and a set of metrics to quantify how RLVR changes access to correct solutions across diverse domains and model scales. The findings show RLVR predominantly preserves the base model's solution coverage, with limited expansion and consistent shrinkage, while improving single-shot precision and reducing answer-level diversity. The authors argue that breaking RLVR's invisible leash likely requires explicit exploration or hybrid strategies that seed probability mass into underrepresented solution regions, with practical implications for designing more capable reasoning systems.

Abstract

Recent advances in LLMs highlight RLVR as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows for improved precision. This study presents an empirical investigation that provides fresh insights into the potential limits of the common practice of RLVR. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy - resulting in greater uncertainty at each generation step - answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

The Invisible Leash: Why RLVR May or May Not Escape Its Origin

TL;DR

The paper investigates whether Reinforcement Learning with Verifiable Rewards (RLVR) truly expands a base model's reasoning or merely reinforces existing high-reward patterns. It introduces an empirical support framework and a set of metrics to quantify how RLVR changes access to correct solutions across diverse domains and model scales. The findings show RLVR predominantly preserves the base model's solution coverage, with limited expansion and consistent shrinkage, while improving single-shot precision and reducing answer-level diversity. The authors argue that breaking RLVR's invisible leash likely requires explicit exploration or hybrid strategies that seed probability mass into underrepresented solution regions, with practical implications for designing more capable reasoning systems.

Abstract

Recent advances in LLMs highlight RLVR as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows for improved precision. This study presents an empirical investigation that provides fresh insights into the potential limits of the common practice of RLVR. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy - resulting in greater uncertainty at each generation step - answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

Paper Structure

This paper contains 55 sections, 6 theorems, 60 equations, 4 figures, 11 tables.

Key Result

Theorem C.1

Let $\pi_\theta(y \mid x)$ be the RLVR-trained distribution obtained via standard on-policy gradient updates on verifiable rewards $R$. Then for all $x \in \mathcal{X}$, In particular, if $q(y^* \mid x) = 0$ for some correct solution $y^*$, then RLVR cannot discover $y^*$.

Figures (4)

  • Figure 1: Empirical support dynamics of RLVR. Left: Conceptual illustration of empirical support under a threshold $\epsilon$. We define four regions based on whether a correct completion $y^* \in \mathcal{C}$ is assigned non-negligible probability mass by the base model $q$ and the RLVR model $\pi_\theta$: Empirical Support Preservation covers completions with $q(y^*|x) > \epsilon$ and $\pi_\theta(y^*|x) > \epsilon$; Empirical Support Shrinkage includes correct completions downweighted by RLVR below $\epsilon$; Empirical Support Expansion includes completions that RLVR newly upweights above $\epsilon$ despite negligible base model mass; and Out of Support refers to completions missed by both. Right: Pie charts showing the proportion of completions in each category across diverse reasoning tasks.
  • Figure 2: Typical empirical support preservation in Reasoning Gym tasks, like Graph Coloring, Palindrome Generation, and Advanced Geometry.
  • Figure 3: Instances of empirical support expansion, as seen in Boxnet, Dice, and Arc 1D tasks.
  • Figure 4: Examples of empirical support shrinkage on Reasoning Gym tasks such as Leg Counting, Family Relationships, and Power Function.

Theorems & Definitions (14)

  • Definition 2.2: Empirical Support Dynamics
  • Definition 2.3: Support Dynamics Metrics
  • Theorem C.1: Support Preservation under RLVR
  • proof
  • Corollary C.2: Asymptotic Sampling Upper Bound
  • proof
  • Theorem C.3: Empirical Support Preservation
  • proof
  • Proposition C.4: KL Projection onto Reward-Consistent Distributions
  • proof
  • ...and 4 more