Table of Contents
Fetching ...

ACOS: Arrays of Cheap Optical Switches

Daniel Amir, Ori Cohen, Jakob Krebs, Mark Silberstein

TL;DR

Simulation shows that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.

Abstract

Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance. We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.

ACOS: Arrays of Cheap Optical Switches

TL;DR

Simulation shows that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.

Abstract

Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance. We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.
Paper Structure (71 sections, 12 figures, 9 tables)

This paper contains 71 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Main building blocks for ACOS topologies
  • Figure 2: (A) Merging resilient rings requires three sets of 2$\times$2 switches. (B) All 2$\times$2 switches on regular links are replicated on offsetting links.
  • Figure 3: 16-GPU cluster with two orthogonal topologies for 2D parallelism. Horizontal (A) -- rings of sizes 4 and 8 GPUs, vertical (B) -- rings of size 4 or 2 GPUs.
  • Figure 4: Two configurations for a non-resilient rack-scale cluster: (A-B,E): TP=4, DP=4, PP=4, EP=16, and (C-D,F) TP=8, DP=4, PP=2, EP=8. Each row shows the same GPUs, with the TP (left) and DP+PP (right) topologies. Color indicates which nodes are connected by the topologies shown on the opposite side. The EP topology is simplified in this figure.
  • Figure 5: Modifications needed to support node-level resiliency in the rack-scale cluster. (A) shows 9 nodes, each of which contains 8 GPUs. One node may fail in each rack. (B) shows a detailed representation of the resilient ring topology used for tensor parallelism. (C) depicts offsetting links used for cross-rack PP links.
  • ...and 7 more figures