Table of Contents
Fetching ...

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler

TL;DR

This work uses multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule, and extracts the parallelism in the Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement.

Abstract

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

TL;DR

This work uses multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule, and extracts the parallelism in the Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement.

Abstract

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
Paper Structure (45 sections, 5 equations, 16 figures, 1 table)

This paper contains 45 sections, 5 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: A simplified overview of the bandwidth-optimal Allgather algorithm represented as a composition of multicast-based Broadcasts. Multicast traffic processing is handled by the Datapath Accelerator. In the example above, the traffic is evenly distributed across two parallel multicast trees. We accommodate the discrepancy between data movement work on the send and receive paths by assigning one send and two receive path workers.
  • Figure 2: Theoretical cost model of bandwidth savings that can be achieved with multicast-based Allgather algorithm compared to classical point-to-point based approaches. The modeled system is a 1024-node cluster connected with a Fat-Tree topology using radix 32 switches.
  • Figure 3: Data movement at the training node boundary.
  • Figure 4: Trade-offs exist between the InfiniBand (IB) Verbs transport layer semantics and Allgather algorithm design. We present a practical multicast-based solution for Unreliable Datagram (UD) and Unreliable Connected (UC) transports.
  • Figure 5: A single-threaded datagram-based datapath running on a server-grade CPU is unable to sustain the 200 Gbit/s link bandwidth, while the datapath offloaded to the single multi-threaded DPA core scales to the peak throughput.
  • ...and 11 more figures