Table of Contents
Fetching ...

RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation

Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, Zixiao Huang, Mingjie Wei, Yuqing Xie, Ke Yang, Bo Dai, Zhexuan Xu, Jiakun Du, Xiangyuan Wang, Xu Fu, Letong Shi, Zhihao Liu, Kang Chen, Weilin Liu, Gang Liu, Boxun Li, Jianlei Yang, Zhi Yang, Guohao Dai, Yu Wang

TL;DR

RLinf introduces macro-to-micro flow transformation (M2Flow) to decouple high-level RL workflow programming from low-level execution planning, enabling flexible and efficient large-scale RL training. It combines a worker abstraction, elastic pipelining, automatic context switching, adaptive communication, and a profiling-guided scheduler to automatically generate optimized execution plans. Across reasoning and embodied RL tasks, RLinf achieves significant throughput gains over state-of-the-art systems, up to 1.1x-2.13x end-to-end speedups. The work demonstrates a path toward highly flexible AI runtimes by unifying heterogeneous RL components under a single execution framework.

Abstract

Reinforcement learning (RL) has demonstrated immense potential in advancing artificial general intelligence, agentic intelligence, and embodied intelligence. However, the inherent heterogeneity and dynamicity of RL workflows often lead to low hardware utilization and slow training on existing systems. In this paper, we present RLinf, a high-performance RL training system based on our key observation that the major roadblock to efficient RL training lies in system flexibility. To maximize flexibility and efficiency, RLinf is built atop a novel RL system design paradigm called macro-to-micro flow transformation (M2Flow), which automatically breaks down high-level, easy-to-compose RL workflows at both the temporal and spatial dimensions, and recomposes them into optimized execution flows. Supported by RLinf worker's adaptive communication capability, we devise context switching and elastic pipelining to realize M2Flow transformation, and a profiling-guided scheduling policy to generate optimal execution plans. Extensive evaluations on both reasoning RL and embodied RL tasks demonstrate that RLinf consistently outperforms state-of-the-art systems, achieving 1.1x-2.13x speedup in end-to-end training throughput.

RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation

TL;DR

RLinf introduces macro-to-micro flow transformation (M2Flow) to decouple high-level RL workflow programming from low-level execution planning, enabling flexible and efficient large-scale RL training. It combines a worker abstraction, elastic pipelining, automatic context switching, adaptive communication, and a profiling-guided scheduler to automatically generate optimized execution plans. Across reasoning and embodied RL tasks, RLinf achieves significant throughput gains over state-of-the-art systems, up to 1.1x-2.13x end-to-end speedups. The work demonstrates a path toward highly flexible AI runtimes by unifying heterogeneous RL components under a single execution framework.

Abstract

Reinforcement learning (RL) has demonstrated immense potential in advancing artificial general intelligence, agentic intelligence, and embodied intelligence. However, the inherent heterogeneity and dynamicity of RL workflows often lead to low hardware utilization and slow training on existing systems. In this paper, we present RLinf, a high-performance RL training system based on our key observation that the major roadblock to efficient RL training lies in system flexibility. To maximize flexibility and efficiency, RLinf is built atop a novel RL system design paradigm called macro-to-micro flow transformation (M2Flow), which automatically breaks down high-level, easy-to-compose RL workflows at both the temporal and spatial dimensions, and recomposes them into optimized execution flows. Supported by RLinf worker's adaptive communication capability, we devise context switching and elastic pipelining to realize M2Flow transformation, and a profiling-guided scheduling policy to generate optimal execution plans. Extensive evaluations on both reasoning RL and embodied RL tasks demonstrate that RLinf consistently outperforms state-of-the-art systems, achieving 1.1x-2.13x speedup in end-to-end training throughput.

Paper Structure

This paper contains 19 sections, 1 equation, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Diverse RL workflows in various scenarios.
  • Figure 2: The distribution of response lengths and the number of unfinished responses over time in the generation phase of a math RL experiment.
  • Figure 3: The execution time of generation and simualtor with different batch sizes respectively, batch size in simulator is the number of environments.
  • Figure 4: The architecture of RLinf.
  • Figure 5: RLinf workflow programming interface.
  • ...and 8 more figures