Table of Contents
Fetching ...

Photonic Rails in ML Datacenters with Opus

Eric Ding, Barry Lyu, Bhaskar Kataria, Rachee Singh

TL;DR

This work proposes a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches, and designs and implements Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries.

Abstract

Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries, and evaluate it on a physical OCS testbed, the Perlmutter supercomputer, and in simulation at up to 2,048 GPUs. Our results show that photonic rails can achieve over $23\times$ network power reduction and $4\times$ cost savings while incurring less than $6\%$ training overhead at production-relevant OCS reconfiguration latencies.

Photonic Rails in ML Datacenters with Opus

TL;DR

This work proposes a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches, and designs and implements Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries.

Abstract

Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries, and evaluate it on a physical OCS testbed, the Perlmutter supercomputer, and in simulation at up to 2,048 GPUs. Our results show that photonic rails can achieve over network power reduction and cost savings while incurring less than training overhead at production-relevant OCS reconfiguration latencies.
Paper Structure (18 sections, 2 equations, 14 figures, 4 tables, 2 algorithms)

This paper contains 18 sections, 2 equations, 14 figures, 4 tables, 2 algorithms.

Figures (14)

  • Figure 1: Rail-optimized fabrics. We propose to replace packet switches (shown as Rail 0, Rail 1 etc.) with optical circuit switches. We make the case for retaining the illusion of full connectivity between GPU ranks connected to the same optical rail switch using in-job reconfiguration.
  • Figure 2: Traffic in a training iteration with 3D parallelism.
  • Figure 3: Communication pattern for PP and FSDP in one iteration, split based on the warm-up, steady, and cool-down stages of the pipeline (4 rails in total, only showing rail 0, TP is hidden). (a) PP=2, FSDP=2. (b) PP=3, FSDP=2.
  • Figure 4: (a) CDF of window size from 10 iters in Exp 1. (b) Rail 0 window break-down based on traffic volume after the window and before the next window, in one iter of Exp 1. <1MB: AllReduce synchronization calls, 64MB: PP Send/Recv, 957MB: DP AllGather, 3829MB: DP ReduceScatter. (c) Rail 0 window size box plot for the step latency from three experiments.
  • Figure 5: Number of windows in one training iteration with different parallelisms. $n_{\text{layer}}$: number of model layers, $n_{\text{microbatch}}$: number of microbatches per global batch, $PP$: pipeline parallel degree.
  • ...and 9 more figures