Table of Contents
Fetching ...

BroRL: Scaling Reinforcement Learning via Broadened Exploration

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, Yi Dong

TL;DR

<3-5 sentence high-level summary> BroRL introduces rollout-size scaling as a principled axis for scaling RL-based reasoning in large language models. Through a mass-balance analysis in the logit domain, it shows that increasing the number of rollouts per prompt dampens a negative unsampled-coupling term, ensuring more reliable positive updates as $N$ grows. Empirically, BroRL revives models that plateau under ProRL and achieves state-of-the-art results on a 1.5B model across math, code, and reasoning benchmarks, while also nearly doubling hardware throughput by shifting generation from memory-bound to compute-bound. This work provides both theoretical guarantees and practical guidance for more data- and compute-efficient RL-based reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.

BroRL: Scaling Reinforcement Learning via Broadened Exploration

TL;DR

<3-5 sentence high-level summary> BroRL introduces rollout-size scaling as a principled axis for scaling RL-based reasoning in large language models. Through a mass-balance analysis in the logit domain, it shows that increasing the number of rollouts per prompt dampens a negative unsampled-coupling term, ensuring more reliable positive updates as grows. Empirically, BroRL revives models that plateau under ProRL and achieves state-of-the-art results on a 1.5B model across math, code, and reasoning benchmarks, while also nearly doubling hardware throughput by shifting generation from memory-bound to compute-bound. This work provides both theoretical guarantees and practical guidance for more data- and compute-efficient RL-based reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.

Paper Structure

This paper contains 37 sections, 3 theorems, 24 equations, 4 figures, 5 tables.

Key Result

Theorem 1

where $A_2, B_2 \geq 0$, and $S_R\in[R_w,R_c]$, which implies $R_c-S_R\ge 0$ and $S_R-R_w \ge 0$. Therefore, the first two terms nonnegative, While the last term represents the coupling of unsampled masses.

Figures (4)

  • Figure 1: Empirical results demonstrate that BroRL ($N=512$) continues to improve math performance, whereas ProRL ($N=16$) reaches a plateau at the 3k-steps checkpoint and further degrades with prolonged training.
  • Figure 2: This illustration shows how a single RLVR update step alters the total probability mass $\Delta Q_{\mathrm{pos}}$ for correct tokens, where the dashed guide lines labeled $Q_{\mathrm{pos}}$ (green) and $Q_{\mathrm{neg}}$ (red) connect the pooled probability assigned to the correct and incorrect token sets across sampled and unsampled regions. The change is composed of two parts: the Sampled portion (left) always produces a nonnegative gain by promoting "sampled-correct" tokens (concentration measured by $A_2$) and demoting "sampled-incorrect" tokens (concentration measured by $B_2$), thereby shifting probability from the $Q_{\mathrm{neg}}$ pool to the $Q_{\mathrm{pos}}$ pool. The unsampled part (right) is conditional: it can add or remove mass depending on the batch mood $S_R$ and whether unsampled incorrect probability is more concentrated than unsampled correct probability. As the number of samples per prompt $N$ grows, the unsampled concentration terms $U_{\mathrm{pos},2}$ and $U_{\mathrm{neg},2}$ shrink, so the net effect tends toward $\Delta Q_{\mathrm{pos}}\!\ge\!0$; the amount of mass moved scales with the pool sizes $Q_{\mathrm{pos}}$ and $Q_{\mathrm{neg}}$.
  • Figure 3: Training dynamics of the simulator under varying rollout size $N$. We track (i) the total probability mass assigned to correct actions, (ii) the fraction of correct actions whose probability increased relative to step 0, and (iii) the worst negative change in probability among correct actions. Larger $N$ produces more stable updates, faster accumulation of probability mass, and crucially it eliminates knowledge shrinkage by removing negative probability drops altogether.
  • Figure 4: Pass@1 comparison of BroRL vs. ProRL, normalized by training compute. Rows show representative trajectories: (1) both improve but BroRL consistently outperforms ProRL; (2) ProRL degrades while BroRL continues to improve; (3) both methods fail to yield consistent gains.

Theorems & Definitions (3)

  • Theorem 1: Sign of Correct-Mass Change
  • Lemma 2
  • Corollary 3