ACOS: Arrays of Cheap Optical Switches

Daniel Amir; Ori Cohen; Jakob Krebs; Mark Silberstein

ACOS: Arrays of Cheap Optical Switches

Daniel Amir, Ori Cohen, Jakob Krebs, Mark Silberstein

TL;DR

Simulation shows that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.

Abstract

Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance. We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.

ACOS: Arrays of Cheap Optical Switches

TL;DR

Abstract

Paper Structure (71 sections, 12 figures, 9 tables)

This paper contains 71 sections, 12 figures, 9 tables.

Introduction
Motivation
Limitations of existing solutions
Opportunity: unique network requirements of ML training
Non-overlapping collectives
Structured and repetitive communication patterns
Executing collectives over low-degree physical topologies
Intra-training OCS reconfiguration
Summary
Commodity hardware support
Inexpensive low-radix OCSes
Multi-port support in modern NICs
Summary
Design considerations
Unified vs. multiple fabrics
...and 56 more sections

Figures (12)

Figure 1: Main building blocks for ACOS topologies
Figure 2: (A) Merging resilient rings requires three sets of 2$\times$2 switches. (B) All 2$\times$2 switches on regular links are replicated on offsetting links.
Figure 3: 16-GPU cluster with two orthogonal topologies for 2D parallelism. Horizontal (A) -- rings of sizes 4 and 8 GPUs, vertical (B) -- rings of size 4 or 2 GPUs.
Figure 4: Two configurations for a non-resilient rack-scale cluster: (A-B,E): TP=4, DP=4, PP=4, EP=16, and (C-D,F) TP=8, DP=4, PP=2, EP=8. Each row shows the same GPUs, with the TP (left) and DP+PP (right) topologies. Color indicates which nodes are connected by the topologies shown on the opposite side. The EP topology is simplified in this figure.
Figure 5: Modifications needed to support node-level resiliency in the rack-scale cluster. (A) shows 9 nodes, each of which contains 8 GPUs. One node may fail in each rack. (B) shows a detailed representation of the resilient ring topology used for tensor parallelism. (C) depicts offsetting links used for cross-rack PP links.
...and 7 more figures

ACOS: Arrays of Cheap Optical Switches

TL;DR

Abstract

ACOS: Arrays of Cheap Optical Switches

Authors

TL;DR

Abstract

Table of Contents

Figures (12)