Table of Contents
Fetching ...

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

Yen-Chieh Wu, Cheng-Shang Chang, Duan-Shin Lee, H. Jonathan Chao

TL;DR

A dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics is proposed, which results in an online scheduler with provable stability under admissible Poisson arrivals and substantial reductions in mean frame length.

Abstract

All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.

Dynamic Hierarchical Birkhoff-von Neumann Decomposition for All-to-All GPU Communication

TL;DR

A dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics is proposed, which results in an online scheduler with provable stability under admissible Poisson arrivals and substantial reductions in mean frame length.

Abstract

All-to-all GPU communication is a critical bottleneck in large-scale training clusters, where completion time is constrained by per-port bandwidth and can be severely impacted by traffic skew across GPUs and network interface cards (NICs). This issue is amplified by the two-tier structure of modern GPU systems, which combine fast intra-server links with much slower inter-server networks. Motivated by recent system observations that highlight the importance of traffic reshaping and hierarchy awareness, we study all-to-all scheduling from an online switching and queueing-theoretic perspective. We propose a dynamic hierarchical Birkhoff--von Neumann (BvN) decomposition framework tailored to two-tier GPU fabrics. At each frame boundary, traffic is first balanced within each server using simple local operations to mitigate micro-level GPU/NIC skew while preserving aggregate server-to-server demand. A hierarchical BvN decomposition is then applied at the server level and refined into GPU-level matchings, significantly reducing decomposition complexity relative to a flat GPU-level approach. By integrating this construction with the dynamic frame sizing (DFS) principle, we obtain an online scheduler with provable stability under admissible Poisson arrivals. Simulations demonstrate substantial reductions in mean frame length, particularly under server-localized hotspot traffic.
Paper Structure (30 sections, 4 theorems, 80 equations, 3 figures)

This paper contains 30 sections, 4 theorems, 80 equations, 3 figures.

Key Result

Proposition 2

Let $\mathbf{X}$ be an $n\times n$ nonnegative integer-valued matrix.

Figures (3)

  • Figure 1: Two-tier GPU communication fabric. GPUs within the same server exchange traffic via a high-bandwidth intra-server switch (bandwidth $B_1$ per port), while inter-server GPU traffic is carried by NICs through the global $mn\times mn$ crossbar with bandwidth $B_2$ per port.
  • Figure 2: Mean frame length $\mathbb{E}[T_f]$ versus $r_0$ with $(n,m)=(8,2)$ under (a) Model U and (b) Model NU. Curves compare DFS with hierarchical BvN decomposition with and without intra-server balancing (Sec. \ref{['sec:balancing']}). Each point is obtained from a fixed slot-horizon simulation with a warm-up period removed from statistics.
  • Figure 3: Mean frame length $\mathbb{E}[T_f]$ versus $r_0$ with $(n,m)=(8,2)$ under intra-server balancing (Sec. \ref{['sec:balancing']}). The plot overlays Model U and Model NU over the common sweep range of $r_0$, showing that the two curves nearly overlap.

Theorems & Definitions (6)

  • Definition 1: Scaled doubly stochastic matrices
  • Proposition 2: Birkhoff--von Neumann decomposition
  • Definition 3: $(m,n)$-block matrix
  • Theorem 4
  • Theorem 5
  • Theorem 6