Table of Contents
Fetching ...

Scheduling Parallel Optical Circuit Switches for AI Training

Kevin Liang, Litao Qiao, Isaac Keslassy, Bill Lin

TL;DR

Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor.

Abstract

The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $δ$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.

Scheduling Parallel Optical Circuit Switches for AI Training

TL;DR

Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor.

Abstract

The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix over parallel OCSes while minimizing the makespan under reconfiguration delay . Our algorithm Spectra relies on a three-step approach: Decompose into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of on GPT AI workloads, on MoE AI workloads, and on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.
Paper Structure (13 sections, 5 theorems, 10 equations, 11 figures, 4 algorithms)

This paper contains 13 sections, 5 theorems, 10 equations, 11 figures, 4 algorithms.

Key Result

Theorem 1

Assume that row/column $i$ has $k_i$ nonzero elements and a total weight $w_i$. Then the scheduling makespan has lower bound

Figures (11)

  • Figure 1: Datacenter network topology with $s$ parallel OCSes.
  • Figure 2: Example of demand matrix $D$.
  • Figure 3: Example decomposition into $k = 3$ weighted permutations that cover $D$.
  • Figure 4: (a) Scheduling the $k = 3$ permutations across $s = 2$ switches. (b) Equalizing the loads among the switches.
  • Figure 5: MoE traffic matrix heatmap from 64 sender GPUs (rows) to 64 receiver GPUs (columns).
  • ...and 6 more figures

Theorems & Definitions (5)

  • Theorem 1: Lower bound 1
  • Lemma 1
  • Theorem 2: Lower bound 2
  • Proposition 1
  • Proposition 2