How Far Can Unsupervised RLVR Scale LLM Training?

Bingxiang He; Yuxin Zuo; Zeyuan Liu; Shangziqi Zhao; Zixuan Fu; Junlin Yang; Cheng Qian; Kaiyan Zhang; Yuchen Fan; Ganqu Cui; Xiusi Chen; Youbang Sun; Xingtai Lv; Xuekai Zhu; Li Sheng; Ran Li; Huan-ang Gao; Yuchen Zhang; Bowen Zhou; Zhiyuan Liu; Ning Ding

How Far Can Unsupervised RLVR Scale LLM Training?

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding

TL;DR

This work revisits URLVR and provides a comprehensive analysis spanning taxonomy, theory and extensive experiments, showing intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices.

Abstract

Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.

How Far Can Unsupervised RLVR Scale LLM Training?

TL;DR

Abstract

Paper Structure (56 sections, 2 theorems, 43 equations, 41 figures, 9 tables)

This paper contains 56 sections, 2 theorems, 43 equations, 41 figures, 9 tables.

Introduction
Taxonomy of Unsupervised RLVR
Intrinsic Reward Methods
External Reward Methods
The Sharpening Mechanism of Intrinsic Rewards
Dynamics of One-Step Update
Convergence Towards Sharpening Initial Distribution
When Does Intrinsic URLVR Work?
The Rise and Fall of Intrinsic URLVR
Early Success, Later Collapse
Different Methods, Different Failures
Fine-Grained Per-Problem Analysis
In-Distribution Per-Problem Sharpening
Out-Of-Distribution Cross-Problem Generalization
How Can Sharpening from Intrinsic URLVR Be Applied Safely?
...and 41 more sections

Key Result

Theorem 1

Geometric Convergence to Deterministic Policy. Consider the training process where at each iteration $k$, we: (1) sample $N$ rollouts $Y_k$ from $\pi_\theta^{(k)}$, (2) compute majority $\text{maj}_k(Y_k)$, (3) perform one-step update with reward $r_k(x,y) = \mathbf{1}[\text{ans}(y) = \text{maj}_k(Y Then $p_{\mathrm{maj}}^{(k)}$ converges geometrically to $1$ with rate $\rho = e^{-1/\beta}$, and t

Figures (41)

Figure 1: Overview of our paper's framework. At the center is the taxonomy of Unsupervised RLVR methods, categorized into intrinsic rewards and external rewards. The four surrounding panels illustrate the key findings of our empirical investigation.
Figure 2: Training dynamics comparing majority-voting training and ground-truth training. Intrinsic rewards initially match supervised performance but eventually collapse: the proxy reward rises while validation accuracy falls, revealing divergence between optimizing confidence and optimizing correctness.
Figure 3: Five intrinsic reward methods exhibit distinct failure patterns. Self-Certainty and Majority Voting degrade gradually while maintaining label accuracy. Probability collapses toward brevity and entropy-based methods drive entropy down through repetition rather than correctness.
Figure 4: Training dynamics on individual representative problems. For each problem, the heatmap shows greedy decoding correctness across epochs (blue = correct, red = wrong; darker = higher confidence), and the green wave indicates whether the highest-reward rollout is correct.
Figure 5: Training Label Accuracy (blue) on six MATH500 problems and Test Label Accuracy on two OOD problems: ID 76 (orange) and ID 131 (green).
...and 36 more figures

Theorems & Definitions (2)

Theorem 1
Proposition 1: Sharpening Dynamics for $\sigma = -1$ Methods

How Far Can Unsupervised RLVR Scale LLM Training?

TL;DR

Abstract

How Far Can Unsupervised RLVR Scale LLM Training?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (41)

Theorems & Definitions (2)