Table of Contents
Fetching ...

Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

Zhi Zhang, Zhen Han, Costas Mavromatis, Qi Zhu, Yunyi Zhang, Sheng Guan, Dingmin Wang, Xiong Zhou, Shuai Wang, Soji Adeshina, Vassilis Ioannidis, Huzefa Rangwala

TL;DR

Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO, matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.

Abstract

Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size $N$. When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves compute efficiency without sacrificing performance. Under the same total rollout budget, AERO reduces total training compute by about 48% while shortening wall-clock time per step by about 45% on average. Despite the substantial reduction in compute, AERO matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.

Train Less, Learn More: Adaptive Efficient Rollout Optimization for Group-Based Reinforcement Learning

TL;DR

Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO, matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.

Abstract

Reinforcement learning (RL) plays a central role in large language model (LLM) post-training. Among existing approaches, Group Relative Policy Optimization (GRPO) is widely used, especially for RL with verifiable rewards (RLVR) fine-tuning. In GRPO, each query prompts the LLM to generate a group of rollouts with a fixed group size . When all rollouts in a group share the same outcome, either all correct or all incorrect, the group-normalized advantages become zero, yielding no gradient signal and wasting fine-tuning compute. We introduce Adaptive Efficient Rollout Optimization (AERO), an enhancement of GRPO. AERO uses an adaptive rollout strategy, applies selective rejection to strategically prune rollouts, and maintains a Bayesian posterior to prevent zero-advantage dead zones. Across three model configurations (Qwen2.5-Math-1.5B, Qwen2.5-7B, and Qwen2.5-7B-Instruct), AERO improves compute efficiency without sacrificing performance. Under the same total rollout budget, AERO reduces total training compute by about 48% while shortening wall-clock time per step by about 45% on average. Despite the substantial reduction in compute, AERO matches or improves Pass@8 and Avg@8 over GRPO, demonstrating a practical, scalable, and compute-efficient strategy for RL-based LLM alignment.
Paper Structure (54 sections, 4 theorems, 38 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 54 sections, 4 theorems, 38 equations, 12 figures, 9 tables, 1 algorithm.

Key Result

Theorem 4.1

Let a training subset contain $n = c + m$ rollouts, with $c$ correct rollouts ($r_i = 1$) forming set $F_C$ and $m$ incorrect rollouts ($r_i = 0$) forming set $F_S$. Under the simplifying assumptions of uniform rollout length $L$ and zero KL penalty ($\beta = 0$), the group-relative policy gradient where $\rho_{i,t} = \frac{\pi_\theta(o_{i,t}\mid q,o_{i,<t})}{\pi_{\text{old}}(o_{i,t}\mid q,o_{i,<

Figures (12)

  • Figure 1: Zero-accuracy problems in GRPO-style training and inference: (\ref{['fig:zero_accuracy_problems']}) zero-accuracy problem ratio during GRPO training; (\ref{['fig:zero_problem_ratio']}) Proportion of zero-accuracy problems during rollout inference across model sizes, averaged over five benchmarks ($n=8$). Light bars show the zero-accuracy ratio $p_0$ (%, left y-axis). Dark bars show the compute inflation factor (right y-axis), defined as $1/(1-p_0)$, i.e., the expected oversampling multiplier needed to obtain the same number of non-zero-accuracy queries. Text annotations (e.g., $1.28\times$) report the inflation factor.
  • Figure 2: Correlation between zero-accuracy ratio and relative model performance ($r=-0.86, p<0.001$).
  • Figure 3: High-level architecture of the Posterior-Guided Sampling (AERO) framework. AERO begins with $n_{\text{total}} = 16$ total rollouts per query, from which $n_{\text{explore}} = 8$ samples are used for stratified exploration based on success rate $u$. Finally, AERO yields an effective training size of approximately $n_{\text{training}}=4.6$ rollouts per query.
  • Figure 4: Rescued problems across iterations.
  • Figure 5: Comparison of per-step (a) wall-clock latency and (b) computational cost (FLOPS), decomposed into Data Generation (Rollout) and Model Update (Training) and Total (Rollout+Training).
  • ...and 7 more figures

Theorems & Definitions (7)

  • Theorem 4.1: GRPO Gradient Norm
  • Lemma 4.2: Maximization of the GROP Gradient Norm
  • proof
  • Theorem 1.1: GRPO Gradient Norm
  • proof
  • Lemma 1.2: Maximization of the GROP Gradient Norm
  • proof