ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

Bangjun Xiao; Yihao Zhao; Xiangwei Deng; Shihua Yu; Yuxing Xiang; Huaqiu Liu; Qiying Wang; Liang Zhao; Hailin Zhang; Xuanzhe Liu; Xin Jin; Fuli Luo

ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

Bangjun Xiao, Yihao Zhao, Xiangwei Deng, Shihua Yu, Yuxing Xiang, Huaqiu Liu, Qiying Wang, Liang Zhao, Hailin Zhang, Xuanzhe Liu, Xin Jin, Fuli Luo

Abstract

Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency. We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3$\times$, speeds up the step duration of RL training by up to 1.5$\times$, and saves the external resources by up to 71.2$\%$. This system has been deployed to support the training of the MiMo series models.

ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

Abstract

, speeds up the step duration of RL training by up to 1.5

, and saves the external resources by up to 71.2

. This system has been deployed to support the training of the MiMo series models.

Paper Structure (23 sections, 4 equations, 9 figures, 1 table, 4 algorithms)

This paper contains 23 sections, 4 equations, 9 figures, 1 table, 4 algorithms.

Introduction
Background
Agentic RL Training
External Resource Management
Over-Provisioning of External Resources
Opportunities and Challenges
Architecture
Unified Action-Level Formulation and Scheduling
Action Formulation
Elastic Resource Scheduling
Heterogeneous Resource Managers
Basic Resource Manager
CPU Manager via AOE
GPU Manager via EOE
Evaluation
...and 8 more sections

Figures (9)

Figure 1: Comparison of existing approaches and ARL-Tangram.
Figure 2: One training step of agentic RL.
Figure 3: (a): average ACT under 1$\times$ and $0.5\times$ external resource quantities. (b): SM activity of 12 different reward services in MOPD. (c): Code agent rollout duration ratio. (d): # External invocations of two agentic RL tasks.
Figure 4: System overview of ARL-Tangram.
Figure 5: Heterogeneous resource management.
...and 4 more figures

ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

Abstract

ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (9)