Table of Contents
Fetching ...

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, Hanghang Tong

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

Paper Structure

This paper contains 44 sections, 5 theorems, 29 equations, 4 figures, 5 tables.

Key Result

Lemma 4.1

Consider a mini-batch of size $G$, each with a label $y_i\in\{0,1\}$. We assume a fixed positive ratio $\rho\in [0,1]$ and conditional independence given latent Bernoulli parameters: $Y_i|q^\star_i\sim \text{Bernoulli} (q^\star_i)$. We define the batch mean $\mu^\star := \frac{1}{G}\sum_{i=1}^G q_i^ Symmetrically, if $\mu^\star<\rho$ and there exists $j$ such that $q_j^\star<\mu^\star$, then the s

Figures (4)

  • Figure 1: ARRoL overview and results. (a) ARRoL uses a quality head to score partial rollouts, enabling early pruning for efficient and reward-balanced training, and the scores can also be used as voting weights for test-time scaling. (b) Wall-clock time comparison between ARRoL and GRPO across different model backbones, showing consistent speedups. (c) Accuracy comparison, where ARRoL improves average accuracy over GRPO.
  • Figure 2: (a) Trace Confidence Failure Modes: Reflection-related tokens tend to receive low confidence despite being beneficial, whereas formula-heavy tokens can receive high confidence even under incorrect reasoning. (b) Distribution Comparison. Trace confidence in (b.1) is less separable between correct/incorrect than quality head scores in (b.2). (c) Correlation Comparison. Quality-head scores achieve consistently higher correlation, measured by Spearman rank correlation between the predicted scores (quality scores or trace confidence) and the binary correctness of final answers on the Math500 and Dapo17k datasets. (d) Generation Length v.s. Correlation & Time Cost. The time cost increases as the generation length increases, while the correlation plateaus when the length reaches 512. All the data is generated by Qwen3-4B model on 400 prompts from Dapo17k and Math500 dataset with 10 rollouts per sample.
  • Figure 3: Illustration of System Design.
  • Figure 4: Wall-clock convergence of Qwen-3-1.7B-Base training.

Theorems & Definitions (10)

  • Lemma 4.1: Existence of a Corrective Pruning
  • Theorem 4.2: High-probability closeness to target $\rho$
  • proof
  • Lemma A.1: Posterior error transfers to batch ratio
  • proof
  • Lemma A.2: Near-optimality of posterior-guided pruning
  • proof
  • Lemma A.3: Concentration of realized ratio around its expectation
  • proof
  • proof