Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Zhida Jiang; Zhaolong Xing; Jiawei Lu; Yipei Niu; Qingyuan Sang; Liangxu Zhang; Wenquan Dai; Junhua Shu; Jiaxing Wang; Qiangyu Pei; Qiong Chen; Xinyu Liu; Fangming Liu; Ai Han; Zhen Chen; Ke Zhang

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, Qiong Chen, Xinyu Liu, Fangming Liu, Ai Han, Zhen Chen, Ke Zhang

TL;DR

FlexMARL addresses core system-level inefficiencies in large-scale LLM-based MARL by co-designing rollout, training, and orchestration. Its disaggregated architecture, experience store, and fine-grained asynchronous pipeline enable parallel rollout, on-demand training, and strong consistency, yielding up to 7.3x speedups and substantially higher hardware utilization in industrial workloads. The results demonstrate scalable performance on a 48-node cluster with heterogeneous agent sizes, verified against multiple baselines. The work provides a practical blueprint for deploying large-scale MARL in production environments and highlights important design considerations for accelerator-driven, multi-agent systems.

Abstract

Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

TL;DR

Abstract

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)