Table of Contents
Fetching ...

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury

TL;DR

The paper addresses the critical problem of bandwidth-optimal all-to-all scheduling on large-scale direct-connect networks used in ML and HPC. It advances a scalable solution by formulating all-to-all as a Max Concurrent Multi-Commodity Flow problem and introducing a master LP with parallel child LPs, plus time-stepped and path-based MCF variants to handle different fabrics. A full compiler/toolchain lowers the optimized flows to practical runtimes (MSCCL/oneCCL) and NIC routing, with demonstrated near-optimal throughput across diverse topologies and scales, including GenKautz expanders that closely meet the theoretical lower bound on all-to-all time. The results show substantial performance gains over baselines, significant runtime efficiency at large N (up to 1000+), and clear guidance for topology choices and scheduling in reconfigurable direct-connect interconnects. This work enables scalable, high-bandwidth all-to-all communication for ML embeddings, FFTs, and MoE workloads on modern accelerator clusters.

Abstract

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

TL;DR

The paper addresses the critical problem of bandwidth-optimal all-to-all scheduling on large-scale direct-connect networks used in ML and HPC. It advances a scalable solution by formulating all-to-all as a Max Concurrent Multi-Commodity Flow problem and introducing a master LP with parallel child LPs, plus time-stepped and path-based MCF variants to handle different fabrics. A full compiler/toolchain lowers the optimized flows to practical runtimes (MSCCL/oneCCL) and NIC routing, with demonstrated near-optimal throughput across diverse topologies and scales, including GenKautz expanders that closely meet the theoretical lower bound on all-to-all time. The results show substantial performance gains over baselines, significant runtime efficiency at large N (up to 1000+), and clear guidance for topology choices and scheduling in reconfigurable direct-connect interconnects. This work enables scalable, high-bandwidth all-to-all communication for ML embeddings, FFTs, and MoE workloads on modern accelerator clusters.

Abstract

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.
Paper Structure (22 sections, 1 theorem, 5 equations, 10 figures, 1 table)

This paper contains 22 sections, 1 theorem, 5 equations, 10 figures, 1 table.

Key Result

Theorem 1

The time taken to accomplish all-to-all communication in a $d$-regular graph $G$ on $N$ nodes scales as $\Omega(N\log_d N)$.

Figures (10)

  • Figure 1: Generating link- and path-based schedules. Top left example shows difference between NIC-based and host-based forwarding of flow from host $\text{H}_0$ to $\text{H}_2$.
  • Figure 2: Topology augmentation to model host-to-NIC bottleneck
  • Figure 3: Throughput of link-based all-to-all schedules on different topologies and runtimes. Appended / G indicates the schedule is lowered to GPUs and the MSCCL msccl runtime, whereas / C means the schedule is lowered to CPUs and the oneCCL oneccl runtime. Averaged over 20 iters.
  • Figure 4: Throughput of route-based all-to-all schedules on different topologies and runtimes. Appended / G indicates the schedule is lowered to GPUs and the NCCL msccl runtime, whereas / C means the schedule is lowered to CPUs and the Open MPI ompi runtime. Averaged over 20 iters.
  • Figure 5: Performance on punctured 3D Torus, removing 3 links (left) or 3 edges (right) at random. Envelope min/max/average (line) over 10 instances (20 iters per)
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1: Lower bound on all-to-all time.