MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

Anton A. Zabreyko; Weiyang Wang; Manya Ghobadi

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

Anton A. Zabreyko, Weiyang Wang, Manya Ghobadi

TL;DR

A tight bound is proved showing any placement can be defragmented to at most two cross-rack fragments per ToR, and MonkeyTree is presented, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques.

Abstract

We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as new arrivals disturb only a few racks. MonkeyTree formulates defragmentation as an integer linear program that minimizes worker movements, subject to per-rack fragmentation bounds. We prove a tight bound showing any placement can be defragmented to at most two cross-rack fragments per ToR, and extend the formulation to hybrid parallelism with multiple rings per server. Migration is implemented via in-memory checkpoint-and-restore over RDMA, incurring only 9.02 seconds of system overhead end-to-end per worker. We evaluate MonkeyTree using a custom simulator modeling clusters of up to 2,048 H200 GPUs and prototype on a five-node A100 testbed. MonkeyTree improves average job completion time by 14 percent over the next best baseline on a cluster of 1,024 GPUs with a 4:1 oversubscription. With a high 16:1 oversubscription ratio and 2,048 GPUs, MonkeyTree keeps p99 job completion time within 5 percent of ideal.

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

TL;DR

Abstract

Paper Structure (28 sections, 2 theorems, 1 equation, 14 figures, 3 tables)

This paper contains 28 sections, 2 theorems, 1 equation, 14 figures, 3 tables.

Introduction
Motivation
Multi-Tenant Training and Congestion
Migration Challenges in CPU Workloads
Tractability of GPU Migration
Congestion-Free States are Achievable
Congestion-Free States Are Abundant
Congestion-Free States are Self-Reinforcing
MonkeyTree System Design
MonkeyTree Controller
MonkeyTree Daemons
MonkeyTree Formulation
Bounding Fragmentation Degree
MonkeyTree ILP for DP and FSDP Jobs
Extending to Model Parallel Jobs
...and 13 more sections

Key Result

Theorem 1

For any cluster and any set of pure DP or FSDP jobs running on it, there exists a job placement such that no rack has a fragmentation degree greater than two.

Figures (14)

Figure 1: A multitenant GPU cluster.
Figure 2: Migration and network congestion in shared CPU vs. GPU clusters. Different colors represent workers in different jobs; each ToR switch supports two units of demand. The CPU cluster (top) requires 11 migrations, yet inter-job congestion persists on racks 4 and 7. The GPU cluster (bottom) needs only 4 migrations to eliminate congestion.
Figure 3: Sample congestion-free states for GPU clusters, for jobs shown in Figure \ref{['fig:mig_example']}. Despite different placements, all states have two or fewer fragmented jobs, thereby generating at most two units of demand per rack.
Figure 4: Overview of MonkeyTree. A centralized controller and per-node daemons operate alongside existing schedulers.
Figure 5: MonkeyTree's ILP formulation for minimizing migration.
...and 9 more figures

Theorems & Definitions (2)

Theorem 1
Corollary 1

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

TL;DR

Abstract

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (2)