New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR
Zhilin Wang, Yafu Li, Shunkai Zhang, Zhi Wang, Haoran Zhang, Xiaoye Qu, Yu Cheng
TL;DR
The paper argues that RLVR can endow LLMs with genuinely new capabilities rather than merely surfacing latent traces, by framing capability as instance-level solvability and showing that sharpening atomic sub-tasks enables complex reasoning to cross a multiplicative barrier. It introduces a probabilistic model of Pass@k, defines epsilon-incapability and delta-capability to partition tasks into Null, Transitional, and Feasible regimes, and demonstrates that RLVR pushes tasks from Null toward Feasible through targeted atomic skill amplification. The work validates the theory on synthetic algebraic tasks generated by the Algebrarium framework, showing emergent capabilities with substantial cross-model evidence and highlighting a concomitant erosion of minority skills due to global optimization. These findings offer a mechanistic account of emergent abilities in RLVR and implications for designing training regimes that balance global rewards with diverse capability retention.
Abstract
Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($ρ\in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.
