Table of Contents
Fetching ...

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, Dakai An, Lunxi Cao, Qiyang Cao, Wanxi Deng, Feilei Du, Yiliang Gu, Jiahe Li, Xiang Li, Mingjie Liu, Yijia Luo, Zihe Liu, Yadao Wang, Pei Wang, Tianyuan Wu, Yanan Wu, Yuheng Zhao, Shuaibing Zhao, Jin Yang, Siran Yang, Yingshui Tan, Huimin Yi, Yuchi Xu, Yujin Yuan, Xingyao Zhang, Lin Qu, Wenbo Su, Wei Wang, Jiamang Wang, Bo Zheng

TL;DR

ROLL introduces a scalable RL optimization library for large-scale learning, featuring a single-controller architecture, modular Parallel Workers, a Rollout Scheduler, specialized Environment and Reward Workers, and AutoDeviceMapping built on Ray. It targets tech pioneers, product developers, and algorithm researchers, offering fine-grained control, flexible resource usage, and robust fault tolerance across large GPU clusters. Empirical results demonstrate substantial speedups and scalability on 200B+-parameter MoE training, with strong gains in RLVR and agentic tasks across math, code, Sokoban, FrozenLake, and WebShop domains, validating both performance and usability. The work also integrates Qwen3 with multimodal, multilingual capabilities and outlines a practical agentic RL pipeline, positioning ROLL as a comprehensive platform for scalable RL experimentation and deployment in LLM contexts.

Abstract

We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL caters to three primary user groups: tech pioneers aiming for cost-effective, fault-tolerant large-scale training, developers requiring flexible control over training workflows, and researchers seeking agile experimentation. ROLL is built upon several key modules to serve these user groups effectively. First, a single-controller architecture combined with an abstraction of the parallel worker simplifies the development of the training pipeline. Second, the parallel strategy and data transfer modules enable efficient and scalable training. Third, the rollout scheduler offers fine-grained management of each sample's lifecycle during the rollout stage. Fourth, the environment worker and reward worker support rapid and flexible experimentation with agentic RL algorithms and reward designs. Finally, AutoDeviceMapping allows users to assign resources to different models flexibly across various stages.

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library

TL;DR

ROLL introduces a scalable RL optimization library for large-scale learning, featuring a single-controller architecture, modular Parallel Workers, a Rollout Scheduler, specialized Environment and Reward Workers, and AutoDeviceMapping built on Ray. It targets tech pioneers, product developers, and algorithm researchers, offering fine-grained control, flexible resource usage, and robust fault tolerance across large GPU clusters. Empirical results demonstrate substantial speedups and scalability on 200B+-parameter MoE training, with strong gains in RLVR and agentic tasks across math, code, Sokoban, FrozenLake, and WebShop domains, validating both performance and usability. The work also integrates Qwen3 with multimodal, multilingual capabilities and outlines a practical agentic RL pipeline, positioning ROLL as a comprehensive platform for scalable RL experimentation and deployment in LLM contexts.

Abstract

We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL caters to three primary user groups: tech pioneers aiming for cost-effective, fault-tolerant large-scale training, developers requiring flexible control over training workflows, and researchers seeking agile experimentation. ROLL is built upon several key modules to serve these user groups effectively. First, a single-controller architecture combined with an abstraction of the parallel worker simplifies the development of the training pipeline. Second, the parallel strategy and data transfer modules enable efficient and scalable training. Third, the rollout scheduler offers fine-grained management of each sample's lifecycle during the rollout stage. Fourth, the environment worker and reward worker support rapid and flexible experimentation with agentic RL algorithms and reward designs. Finally, AutoDeviceMapping allows users to assign resources to different models flexibly across various stages.

Paper Structure

This paper contains 49 sections, 7 figures.

Figures (7)

  • Figure 1: For three primary user groups, we introduce an efficient, scalable, and user-friendly library ROLL, which provides specific key features for large-scale RL optimization.
  • Figure 2: (a) The architecture of ROLL, which consists of the user input layer, a distributed executor & scheduler, an Auto Device Mapping module, and a resource pool. (b) The runtime setup and the training workflow of ROLL.
  • Figure 3: Accuracy Trends Across Different Tasks on Qwen2.5-7B-Base.
  • Figure 4: Accuracy Trends Across Different Tasks on Qwen3-30B-A3B-Base.
  • Figure 5: Performance metrics for the SimpleSokoban environment training. SuccessRate denotes the success rate of reaching the goal. EffectiveActionRate represents the proportion of valid actions executed.
  • ...and 2 more figures