Table of Contents
Fetching ...

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

Anton A. Zabreyko, Weiyang Wang, Manya Ghobadi

TL;DR

A tight bound is proved showing any placement can be defragmented to at most two cross-rack fragments per ToR, and MonkeyTree is presented, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques.

Abstract

We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as new arrivals disturb only a few racks. MonkeyTree formulates defragmentation as an integer linear program that minimizes worker movements, subject to per-rack fragmentation bounds. We prove a tight bound showing any placement can be defragmented to at most two cross-rack fragments per ToR, and extend the formulation to hybrid parallelism with multiple rings per server. Migration is implemented via in-memory checkpoint-and-restore over RDMA, incurring only 9.02 seconds of system overhead end-to-end per worker. We evaluate MonkeyTree using a custom simulator modeling clusters of up to 2,048 H200 GPUs and prototype on a five-node A100 testbed. MonkeyTree improves average job completion time by 14 percent over the next best baseline on a cluster of 1,024 GPUs with a 4:1 oversubscription. With a high 16:1 oversubscription ratio and 2,048 GPUs, MonkeyTree keeps p99 job completion time within 5 percent of ideal.

MonkeyTree: Near-Minimal Congestion for Multi-tenant Training via Migration

TL;DR

A tight bound is proved showing any placement can be defragmented to at most two cross-rack fragments per ToR, and MonkeyTree is presented, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques.

Abstract

We present MonkeyTree, the first system to mitigate network congestion in multi-tenant GPU clusters through job-migration based defragmentation rather than network-layer techniques. As cloud operators co-locate ML training jobs on shared, oversubscribed networks, congestion degrades training throughput for over a third of jobs. Prior approaches either rely on routing and flow scheduling--which we show have fundamental limits when traffic exceeds capacity--or require costly full-bisection bandwidth topologies with packet spraying. MonkeyTree exploits characteristics of ML training traffic: ring-based collectives generate exactly one cross-rack flow per rack a job spans, making congestion-free placements achievable. The sparse constraint structure admits abundant valid configurations, making them easy to reach with few migrations. Once reached, low fragmentation is self-reinforcing, as new arrivals disturb only a few racks. MonkeyTree formulates defragmentation as an integer linear program that minimizes worker movements, subject to per-rack fragmentation bounds. We prove a tight bound showing any placement can be defragmented to at most two cross-rack fragments per ToR, and extend the formulation to hybrid parallelism with multiple rings per server. Migration is implemented via in-memory checkpoint-and-restore over RDMA, incurring only 9.02 seconds of system overhead end-to-end per worker. We evaluate MonkeyTree using a custom simulator modeling clusters of up to 2,048 H200 GPUs and prototype on a five-node A100 testbed. MonkeyTree improves average job completion time by 14 percent over the next best baseline on a cluster of 1,024 GPUs with a 4:1 oversubscription. With a high 16:1 oversubscription ratio and 2,048 GPUs, MonkeyTree keeps p99 job completion time within 5 percent of ideal.
Paper Structure (28 sections, 2 theorems, 1 equation, 14 figures, 3 tables)

This paper contains 28 sections, 2 theorems, 1 equation, 14 figures, 3 tables.

Key Result

Theorem 1

For any cluster and any set of pure DP or FSDP jobs running on it, there exists a job placement such that no rack has a fragmentation degree greater than two.

Figures (14)

  • Figure 1: A multitenant GPU cluster.
  • Figure 2: Migration and network congestion in shared CPU vs. GPU clusters. Different colors represent workers in different jobs; each ToR switch supports two units of demand. The CPU cluster (top) requires 11 migrations, yet inter-job congestion persists on racks 4 and 7. The GPU cluster (bottom) needs only 4 migrations to eliminate congestion.
  • Figure 3: Sample congestion-free states for GPU clusters, for jobs shown in Figure \ref{['fig:mig_example']}. Despite different placements, all states have two or fewer fragmented jobs, thereby generating at most two units of demand per rack.
  • Figure 4: Overview of MonkeyTree. A centralized controller and per-node daemons operate alongside existing schedulers.
  • Figure 5: MonkeyTree's ILP formulation for minimizing migration.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Corollary 1