Table of Contents
Fetching ...

Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, Beidi Chen

TL;DR

This work tackles the computational bottleneck of reinforcement-learning-based fine-tuning for large language models by introducing GRESO, an online pre-rollout filtering method that skips uninformative prompts using reward training dynamics. Grounded in observations of strong temporal consistency in prompt value and the dynamics of zero-variance prompts under GRPO, GRESO predicts and filters prompts before rollout with a probabilistic mechanism that balances exploration and efficiency. Empirical results across multiple math benchmarks and model sizes demonstrate up to $2.4\times$ rollout speedups and up to $2.0\times$ overall training time reductions, without sacrificing accuracy. The approach offers a practical path to scalable RL for LLM reasoning, reducing wasted computation and enabling more efficient rollout scaling in real-world settings.

Abstract

Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance. However, this comes at the cost of significant computational overhead. In this paper, we show that a substantial portion of this overhead can be avoided by skipping uninformative prompts before rollout. Our analysis of reward dynamics reveals a strong temporal consistency in prompt value: prompts that are uninformative in one epoch of training are likely to remain uninformative in future epochs. Based on these insights, we propose GRESO (GRPO with Efficient Selective Rollout), an online, lightweight pre-rollout filtering algorithm that predicts and skips uninformative prompts using reward training dynamics. By evaluating GRESO on a broad range of math reasoning benchmarks and models, such as Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B, we show that GRESO achieves up to 2.4x wall-clock time speedup in rollout and up to 2.0x speedup in total training time without accuracy degradation.

Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

TL;DR

This work tackles the computational bottleneck of reinforcement-learning-based fine-tuning for large language models by introducing GRESO, an online pre-rollout filtering method that skips uninformative prompts using reward training dynamics. Grounded in observations of strong temporal consistency in prompt value and the dynamics of zero-variance prompts under GRPO, GRESO predicts and filters prompts before rollout with a probabilistic mechanism that balances exploration and efficiency. Empirical results across multiple math benchmarks and model sizes demonstrate up to rollout speedups and up to overall training time reductions, without sacrificing accuracy. The approach offers a practical path to scalable RL for LLM reasoning, reducing wasted computation and enabling more efficient rollout scaling in real-world settings.

Abstract

Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance. However, this comes at the cost of significant computational overhead. In this paper, we show that a substantial portion of this overhead can be avoided by skipping uninformative prompts before rollout. Our analysis of reward dynamics reveals a strong temporal consistency in prompt value: prompts that are uninformative in one epoch of training are likely to remain uninformative in future epochs. Based on these insights, we propose GRESO (GRPO with Efficient Selective Rollout), an online, lightweight pre-rollout filtering algorithm that predicts and skips uninformative prompts using reward training dynamics. By evaluating GRESO on a broad range of math reasoning benchmarks and models, such as Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, and Qwen2.5-Math-7B, we show that GRESO achieves up to 2.4x wall-clock time speedup in rollout and up to 2.0x speedup in total training time without accuracy degradation.

Paper Structure

This paper contains 23 sections, 10 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: We train Qwen2.5-Math-1.5B/7B on the DAPO + MATH dataset and evaluate them on five math reasoning benchmarks: MATH500, AMC, Gaokao, Minerva, and Olympiad Bench. Compared to the baseline method (Dynamic Sampling), our approach (GRESO) reduces rollout overhead by up to $2\times$ while achieving comparable training performance, improving the efficiency of rollout scaling.
  • Figure 2: Left: GRPO training with more effective data through Dynamic Sampling (DS) leads to improved final model performance. Right: However, DS requires additional rollouts to maintain the same training batch size.
  • Figure 3: Dynamics of effective prompts ratio in each step in GRPO training. The ratio keeps decreasing as the training proceeds.
  • Figure 4: (a) Temporal correlation of examples across epochs. Prompts previously identified as zero-variance are likely to remain zero-variance. (b) Pipeline comparison between Dynamic Sampling and our GRESO method. Unlike Dynamic Sampling, which filters out zero-variance prompts after rollout, GRESO efficiently predicts and filters them based on training dynamics before rollout, which improves rollout efficiency. The probabilistic filtering also allows zero-variance prompts to still be occasionally sampled, enabling the model to revisit potentially valuable prompts.
  • Figure 5: Training dynamics analysis of Qwen-Math-1.5B trained on the DAPO + MATH dataset: (a) Effective prompt ratio in each step. GRESO maintains a consistently higher effective prompt ratio during training. (b) To obtain the same number of effective prompts per batch, GRESO requires less rollout time. (c) GRESO achieves more effective rollouts for training under the same rollout time budget compared to Dynamic Sampling. (d) Ablation study on adaptive batch size (ABS) for sampling: Both ABS and GRESO effectively reduce the number of rollouts per training step.
  • ...and 4 more figures