Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Yixuan Even Xu, Yash Savani, Fei Fang, J. Zico Kolter
TL;DR
This paper identifies a fundamental bottleneck in RLVR for LLMs: rollout generation scales well, but policy updates are memory- and communication-bound. It proposes PODS, which generates many rollouts per prompt but trains on a carefully selected subset of size $m$ using a max-variance down-sampling criterion, achieving significant speedups without sacrificing performance. The method is validated across multiple model sizes (3B–7B), architectures (Qwen2.5, Llama3.2), and hardware configurations on GSM8K and MATH, with GRPO-PODS reaching peak baselines at least 1.7× faster and often better final accuracy. The work offers a practical, algorithm-agnostic approach to accelerate RLVR training and opens avenues for integrating PODS with other RL methods and adaptive sampling strategies.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.
