Photonic Rails in ML Datacenters with Opus

Eric Ding; Barry Lyu; Bhaskar Kataria; Rachee Singh

Photonic Rails in ML Datacenters with Opus

Eric Ding, Barry Lyu, Bhaskar Kataria, Rachee Singh

TL;DR

This work proposes a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches, and designs and implements Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries.

Abstract

Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that orchestrates this in-job reconfiguration of photonic rails at parallelism phase boundaries, and evaluate it on a physical OCS testbed, the Perlmutter supercomputer, and in simulation at up to 2,048 GPUs. Our results show that photonic rails can achieve over $23\times$ network power reduction and $4\times$ cost savings while incurring less than $6\%$ training overhead at production-relevant OCS reconfiguration latencies.

Photonic Rails in ML Datacenters with Opus

TL;DR

Abstract

network power reduction and

cost savings while incurring less than

training overhead at production-relevant OCS reconfiguration latencies.

Paper Structure (18 sections, 2 equations, 14 figures, 4 tables, 2 algorithms)

This paper contains 18 sections, 2 equations, 14 figures, 4 tables, 2 algorithms.

Introduction
Our Proposal: Electrical Rails $\rightarrow$ Optical Rails
Optical Rails: Challenges and Opportunities
Challenges
Opportunities
Parallelism-driven Rail Reconfiguration
Opus System Architecture
Reconfiguration Protocol
Implementation
Evaluating Opus
Lab Hardware Evaluation
Emulation of Opus on a Supercomputer
Large-scale Simulation of Opus
Related Work
Discussion
...and 3 more sections

Figures (14)

Figure 1: Rail-optimized fabrics. We propose to replace packet switches (shown as Rail 0, Rail 1 etc.) with optical circuit switches. We make the case for retaining the illusion of full connectivity between GPU ranks connected to the same optical rail switch using in-job reconfiguration.
Figure 2: Traffic in a training iteration with 3D parallelism.
Figure 3: Communication pattern for PP and FSDP in one iteration, split based on the warm-up, steady, and cool-down stages of the pipeline (4 rails in total, only showing rail 0, TP is hidden). (a) PP=2, FSDP=2. (b) PP=3, FSDP=2.
Figure 4: (a) CDF of window size from 10 iters in Exp 1. (b) Rail 0 window break-down based on traffic volume after the window and before the next window, in one iter of Exp 1. <1MB: AllReduce synchronization calls, 64MB: PP Send/Recv, 957MB: DP AllGather, 3829MB: DP ReduceScatter. (c) Rail 0 window size box plot for the step latency from three experiments.
Figure 5: Number of windows in one training iteration with different parallelisms. $n_{\text{layer}}$: number of model layers, $n_{\text{microbatch}}$: number of microbatches per global batch, $PP$: pipeline parallel degree.
...and 9 more figures

Photonic Rails in ML Datacenters with Opus

TL;DR

Abstract

Photonic Rails in ML Datacenters with Opus

Authors

TL;DR

Abstract

Table of Contents

Figures (14)