Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Yen-Chieh Wu; Cheng-Shang Chang; Duan-Shin Lee; H. Jonathan Chao

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Yen-Chieh Wu, Cheng-Shang Chang, Duan-Shin Lee, H. Jonathan Chao

TL;DR

A dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics is proposed, which results in an online scheduler with provable stability under admissible Poisson arrivals and substantial reductions in mean frame length.

Abstract

All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

TL;DR

Abstract

Paper Structure (30 sections, 4 theorems, 80 equations, 3 figures)

This paper contains 30 sections, 4 theorems, 80 equations, 3 figures.

Introduction
System Model
Two-Tier Fabric and Intra-Server Load Balancing
Server-Level Traffic Aggregation
Hierarchical Birkhoff--von Neumann Decomposition
Preliminaries
Permutation and subpermutation matrices
The Theory of Hierarchical Birkhoff--von Neumann Decomposition
Using Hierarchical Birkhoff-von Neumann Decomposition for the Two-Tier Fabric
Balancing the Traffic within a Server
The objective of traffic balancing
Local Unit-Transfer Operations
Column Transfer
Row Transfer
Two-Phase Balancing Algorithm
...and 15 more sections

Key Result

Proposition 2

Let $\mathbf{X}$ be an $n\times n$ nonnegative integer-valued matrix.

Figures (3)

Figure 1: Two-tier GPU communication fabric. GPUs within the same server exchange traffic via a high-bandwidth intra-server switch (bandwidth $B_1$ per port), while inter-server GPU traffic is carried by NICs through the global $mn\times mn$ crossbar with bandwidth $B_2$ per port.
Figure 2: Mean frame length $\mathbb{E}[T_f]$ versus $r_0$ with $(n,m)=(8,2)$ under (a) Model U and (b) Model NU. Curves compare DFS with hierarchical BvN decomposition with and without intra-server balancing (Sec. \ref{['sec:balancing']}). Each point is obtained from a fixed slot-horizon simulation with a warm-up period removed from statistics.
Figure 3: Mean frame length $\mathbb{E}[T_f]$ versus $r_0$ with $(n,m)=(8,2)$ under intra-server balancing (Sec. \ref{['sec:balancing']}). The plot overlays Model U and Model NU over the common sweep range of $r_0$, showing that the two curves nearly overlap.

Theorems & Definitions (6)

Definition 1: Scaled doubly stochastic matrices
Proposition 2: Birkhoff--von Neumann decomposition
Definition 3: $(m,n)$-block matrix
Theorem 4
Theorem 5
Theorem 6

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

TL;DR

Abstract

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (6)