Table of Contents
Fetching ...

Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

TL;DR

ROLL Flash advances RL post-training by introducing fine-grained, rollout–train decoupled, asynchronous execution that significantly improves resource utilization and scalability without sacrificing performance. It combines queue scheduling, prompt replication, and environment-level asynchronous rollout with an adaptive AsyncRatio and off-policy algorithms (e.g., PPO, GRPO, TOPR, CISPO) to maintain stability. Theoretical bounds on generation and end-to-end times, alongside extensive experiments, show up to 2.24× throughput gains on RLVR and 2.72× on agentic tasks across large GPU pools, with near-maximal gains achievable with modest asynchrony. These results demonstrate that asynchronous RL post-training can deliver substantial efficiency improvements in both RLVR and agentic domains while preserving model quality, enabling scalable deployment in large LLM–driven systems.

Abstract

Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.

Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony

TL;DR

ROLL Flash advances RL post-training by introducing fine-grained, rollout–train decoupled, asynchronous execution that significantly improves resource utilization and scalability without sacrificing performance. It combines queue scheduling, prompt replication, and environment-level asynchronous rollout with an adaptive AsyncRatio and off-policy algorithms (e.g., PPO, GRPO, TOPR, CISPO) to maintain stability. Theoretical bounds on generation and end-to-end times, alongside extensive experiments, show up to 2.24× throughput gains on RLVR and 2.72× on agentic tasks across large GPU pools, with near-maximal gains achievable with modest asynchrony. These results demonstrate that asynchronous RL post-training can deliver substantial efficiency improvements in both RLVR and agentic domains while preserving model quality, enabling scalable deployment in large LLM–driven systems.

Abstract

Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.

Paper Structure

This paper contains 32 sections, 2 theorems, 9 equations, 11 figures, 1 table.

Key Result

proposition 1

Generation Time Boundgentime Let there be $K$ workers executing in a Queue Scheduling manner (a new task is assigned immediately once a worker finishes). Suppose $Q$ samples need to be generated, where the generation time of each sample lies in $[0, L_{\text{gen}}]$ with mean $\mu_{\text{gen}}$. The Consequently, the average completion time per sample is bounded by: As $\alpha \to \infty$, the pe

Figures (11)

  • Figure 1: (a) We illustrate vanilla synchronous training alongside several optimizations introduced by ROLL Flash: queue scheduling (\ref{['sec:Queue_Scheduling']}), prompt replication (\ref{['sec:prompt_replication']}), and an asynchronous architecture (\ref{['sec:framework']}). (b) We present how the throughput of the training architectures illustrated in (a) scales with the number of GPUs on the Qwen3-8B-Base and Think models. In the top panel of \ref{['fig:Throughput']}, the asynchronous approach achieves higher efficiency and exhibits strong scalability with increasing GPU count, delivering $2.12\times$ throughput over synchronous structure on 128 GPUs. In the bottom of \ref{['fig:Throughput']}, all methods scale poorly at low average sequence lengths. Nevertheless, the asynchronous approach mitigates the impact of long-tail rollouts and is significantly more efficient than the synchronous approach ($1.53\times$ to $2.24\times$ faster). More detailed experiments and analyses can be found in \ref{['sec:why_async']}.
  • Figure 2: An illustration of Training Acceleration with ROLL Flash.
  • Figure 3: Efficiency Comparison using Async and Sync under different rollout batch size and training-inference resource ratios. (a) Given a fixed GPU resource budget, optimal efficiency can be achieved by tuning the allocation ratio between training and inference. (b) shows the efficiency scaling curves of Async and the ROLL-Sync. Async exhibits a clear advantage in almost all cases.
  • Figure 4: Off-Policy Algorithm Performance Comparison under Async Ratio 2 and 8. To ensure clarity and intuitiveness in the qualitative analysis, all curves are consistently smoothed using identical parameters. Specifically, the mean values are computed using an 11-step moving window. The shaded regions around the curves represent the range mean$\pm$ (std_multiplier $\times$ standard deviation), providing a visual representation of the oscillation amplitude. The Sync baseline uses the performance at 400 steps.
  • Figure 5: Asynchronous Execution Workflow of ROLL Flash for RLVR and Agentic Post-Training. It consists of LLMProxy, EnvManagers, SampleBuffer, and AsyncController, which together orchestrate an asynchronous training workflow with fine-grained parallelism.
  • ...and 6 more figures

Theorems & Definitions (2)

  • proposition 1
  • proposition 2