Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
TL;DR
ROLL Flash advances RL post-training by introducing fine-grained, rollout–train decoupled, asynchronous execution that significantly improves resource utilization and scalability without sacrificing performance. It combines queue scheduling, prompt replication, and environment-level asynchronous rollout with an adaptive AsyncRatio and off-policy algorithms (e.g., PPO, GRPO, TOPR, CISPO) to maintain stability. Theoretical bounds on generation and end-to-end times, alongside extensive experiments, show up to 2.24× throughput gains on RLVR and 2.72× on agentic tasks across large GPU pools, with near-maximal gains achievable with modest asynchrony. These results demonstrate that asynchronous RL post-training can deliver substantial efficiency improvements in both RLVR and agentic domains while preserving model quality, enabling scalable deployment in large LLM–driven systems.
Abstract
Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.
