Table of Contents
Fetching ...

New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

Zhilin Wang, Yafu Li, Shunkai Zhang, Zhi Wang, Haoran Zhang, Xiaoye Qu, Yu Cheng

TL;DR

The paper argues that RLVR can endow LLMs with genuinely new capabilities rather than merely surfacing latent traces, by framing capability as instance-level solvability and showing that sharpening atomic sub-tasks enables complex reasoning to cross a multiplicative barrier. It introduces a probabilistic model of Pass@k, defines epsilon-incapability and delta-capability to partition tasks into Null, Transitional, and Feasible regimes, and demonstrates that RLVR pushes tasks from Null toward Feasible through targeted atomic skill amplification. The work validates the theory on synthetic algebraic tasks generated by the Algebrarium framework, showing emergent capabilities with substantial cross-model evidence and highlighting a concomitant erosion of minority skills due to global optimization. These findings offer a mechanistic account of emergent abilities in RLVR and implications for designing training regimes that balance global rewards with diverse capability retention.

Abstract

Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($ρ\in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.

New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

TL;DR

The paper argues that RLVR can endow LLMs with genuinely new capabilities rather than merely surfacing latent traces, by framing capability as instance-level solvability and showing that sharpening atomic sub-tasks enables complex reasoning to cross a multiplicative barrier. It introduces a probabilistic model of Pass@k, defines epsilon-incapability and delta-capability to partition tasks into Null, Transitional, and Feasible regimes, and demonstrates that RLVR pushes tasks from Null toward Feasible through targeted atomic skill amplification. The work validates the theory on synthetic algebraic tasks generated by the Algebrarium framework, showing emergent capabilities with substantial cross-model evidence and highlighting a concomitant erosion of minority skills due to global optimization. These findings offer a mechanistic account of emergent abilities in RLVR and implications for designing training regimes that balance global rewards with diverse capability retention.

Abstract

Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients (); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.
Paper Structure (52 sections, 13 equations, 7 figures, 2 tables)

This paper contains 52 sections, 13 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Theoretical visualization of $Pass@k$. (a) Functional curves for a single problem under varying correctness probabilities. (b) and (c) Composite average curves for a simplified dataset ($|\mathcal{D}|=2$), showing the aggregation from instances to the dataset-level metric.
  • Figure 2: Comparison of theoretical and empirical Pass@k curves. The figure illustrates that the theoretical and empirical curves align remarkably well, consistently yielding a Mean Squared Error (MSE) magnitude less than $10^{-4}$. Furthermore, the RL models demonstrate substantial performance improvements over the Base models across the sampling spectrum.
  • Figure 3: Analysis of Capability Emergence. We analyze the subset of tasks initially classified as Null State ($\text{Avg@128} = 0$) in the Base model. Left: The recovery rate, indicating the proportion of these impossible tasks that successfully transitioned to the Feasible State ($\text{Avg@128} \ge 0.125$) after RL training. Right: Post-emergence accuracy distribution. The significant skew toward high performance (mean $\approx 0.60$) indicates that emergence occurs as a sharp phase transition.
  • Figure 4: Verification of Exponential Decay. The figure validates the Multiplicative Barrier hypothesis ($P \propto p^N$).
  • Figure 5: Process vs. Outcome Correlation Analysis. Scatter plots correlating joint step accuracy ($\prod P_\theta(s_j)$) with Outcome Accuracy ($\prod P_\theta(q)$). Selected base models from the Llama and Gemma families cluster in the "Null Regime" near the origin, indicating that a breakdown in the joint reasoning chain leads to task failure. In contrast, Qwen models and RL-tuned models exhibit a high correlation.
  • ...and 2 more figures