Table of Contents
Fetching ...

Leaf-centric Logical Topology Design for OCS-based GPU Clusters

Xinchi Han, Weihao Jiang, Yingming Mao, Yike Liu, Zhuoran Liu, Yongxi Lv, Peirui Cao, Zhuotao Liu, Ximeng Liu, Xinbing Wang, Changbo Wu, Zihan Zhu, Wu Dongchao, Yang Jian, Zhang Zhanbang, Yuansen Chen, Shizhen Zhao

Abstract

Recent years have witnessed the growing deployment of optical circuit switches (OCS) in commercial GPU clusters (e.g., Google A3 GPU cluster) optimized for machine learning (ML) workloads. Such clusters adopt a three-tier leaf-spine-OCS topology, servers attach to leaf-layer electronic packet switches (EPSes); these leaf switches aggregate into spine-layer EPSes to form a Pod; and multiple Pods are interconnected via core-layer OCSes. Unlike EPSes, OCSes only support circuit-based paths between directly connected spine switches, potentially inducing a phenomenon termed routing polarization, which refers to the scenario where the bandwidth requirements between specific pairs of Pods are unevenly fulfilled through links among different spine switches. The resulting imbalance induces traffic contention and bottlenecks on specific leaf-to-spine links, ultimately reducing ML training throughput. To mitigate this issue, we introduce a leaf-centric paradigm to ensure traffic originating from the same leaf switch is evenly distributed across multiple spine switches with balanced loads. Through rigorous theoretical analysis, we establish a sufficient condition for avoiding routing polarization and propose a corresponding logical topology design algorithm with polynomial-time complexity. Large-scale simulations validate up to 19.27% throughput improvement and a 99.16% reduction in logical topology computation overhead compared to Mixed Integer Programming (MIP)-based methods.

Leaf-centric Logical Topology Design for OCS-based GPU Clusters

Abstract

Recent years have witnessed the growing deployment of optical circuit switches (OCS) in commercial GPU clusters (e.g., Google A3 GPU cluster) optimized for machine learning (ML) workloads. Such clusters adopt a three-tier leaf-spine-OCS topology, servers attach to leaf-layer electronic packet switches (EPSes); these leaf switches aggregate into spine-layer EPSes to form a Pod; and multiple Pods are interconnected via core-layer OCSes. Unlike EPSes, OCSes only support circuit-based paths between directly connected spine switches, potentially inducing a phenomenon termed routing polarization, which refers to the scenario where the bandwidth requirements between specific pairs of Pods are unevenly fulfilled through links among different spine switches. The resulting imbalance induces traffic contention and bottlenecks on specific leaf-to-spine links, ultimately reducing ML training throughput. To mitigate this issue, we introduce a leaf-centric paradigm to ensure traffic originating from the same leaf switch is evenly distributed across multiple spine switches with balanced loads. Through rigorous theoretical analysis, we establish a sufficient condition for avoiding routing polarization and propose a corresponding logical topology design algorithm with polynomial-time complexity. Large-scale simulations validate up to 19.27% throughput improvement and a 99.16% reduction in logical topology computation overhead compared to Mixed Integer Programming (MIP)-based methods.

Paper Structure

This paper contains 23 sections, 5 theorems, 31 equations, 6 figures, 1 algorithm.

Key Result

Theorem 2.1

The problem of designing a Leaf-centric logical topology is NP-complete for intra-Pod physical topologies with $\tau=1$.

Figures (6)

  • Figure 1: LumosCore adopts a typical three-tier topology comprising leaf layer, spine layer, and am optical core layer.
  • Figure 2: Pod-centric logical topology may result in the routing polarization issue.
  • Figure 3: An illustrative example demonstrating that a poorly designed intra-Pod physical topology can give rise to inherently unavoidable routing polarization. Notably, the logical topology should meet the L2 compatibility constraint poutievski2022jupiter108922020On2021Geminihan2024lumoscore which means that if the ingress port of an optical module $A$ on a spine switch is connected to the egress port of an optical module $B$ on another spine switch, then the egress port of $A$ must also be connected to the ingress port of $B$.
  • Figure 4: The comparative performance of the evaluated strategies highlights the critical role of leaf-centric logical topology design and intra-Pod physical topology design. Unless otherwise specified, the workload level is fixed at 0.767, and the cluster configuration comprises 8,192 GPUs by default.
  • Figure 5: Comparative analysis of average time overhead in logical topology design
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 2.1
  • Proof 1
  • Theorem 2.2
  • Theorem 2.3
  • Theorem 3.1
  • Proof 2
  • Theorem 3.2