Table of Contents
Fetching ...

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You

Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.
Paper Structure (35 sections, 4 equations, 9 figures, 1 table)

This paper contains 35 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: (a) Latency breakdown of RL training for LLMs. (b) GPU wall time per rollout batch with 128 batch size. (c) Length distribution of sampled trajectories during rollout. Visualizations are based on DeepSeek-R1-Distill-Llama-8B guo2025deepseek with a 4K maximum generation length.
  • Figure 2: The SortedRL framework. a, The architecture of the SortedRL engine, consisting of two core modules: a length-aware controller and a stateful rollout buffer. The RL training pipeline includes five key steps: 1) concatenate buffer and feed prompts, 2) early termination, 3) collect and update rollout trajectories, and 4) sort and feed training batches; b and c, imaginary timeline and SortedRL strategy. Samples in same batch are denoted in same color. Dotted lines and boxes indicates the harvest timing. For fully on-policy mode, the gray bars are partially discarded incomplete samples or non-scheduled prompts, while there is no discarded trajectories in partial mode.
  • Figure 3: LogicRL overall results. On-Policy and Partial are variants of SortedRL
  • Figure 4: Mathematical task overall results. On-Policy and Partial are variants of SortedRL
  • Figure 5: Rollout throughputs under different strategies.
  • ...and 4 more figures