StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Yinmin Zhong; Zili Zhang; Xiaoniu Song; Hanpeng Hu; Chao Jin; Bingyang Wu; Nuo Chen; Yukun Chen; Yu Zhou; Changyi Wan; Hongyu Zhou; Yimin Jiang; Yibo Zhu; Daxin Jiang

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang

TL;DR

StreamRL revisits disaggregated RL for LLM post-training to overcome pipeline and skewness inefficiencies. It introduces a streaming two-stage design with a Stream Generation Service and a Trainer, backed by a profiler-based resource allocator and a skewness-aware scheduling pipeline that uses an output-length ranker. Through extensive end-to-end and cross-datacenter experiments on 7B–72B Qwen models, it achieves up to 2.66x throughput gains and up to 1.33x cost-effectiveness over state-of-the-art baselines. The work demonstrates that disaggregation, when coupled with streaming and data-tailored scheduling, can realize practical, scalable RL for large-scale LLMs across heterogeneous hardware and networks.

Abstract

Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment. StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting.

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

TL;DR

Abstract

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)