Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies
Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury
TL;DR
The paper addresses the critical problem of bandwidth-optimal all-to-all scheduling on large-scale direct-connect networks used in ML and HPC. It advances a scalable solution by formulating all-to-all as a Max Concurrent Multi-Commodity Flow problem and introducing a master LP with parallel child LPs, plus time-stepped and path-based MCF variants to handle different fabrics. A full compiler/toolchain lowers the optimized flows to practical runtimes (MSCCL/oneCCL) and NIC routing, with demonstrated near-optimal throughput across diverse topologies and scales, including GenKautz expanders that closely meet the theoretical lower bound on all-to-all time. The results show substantial performance gains over baselines, significant runtime efficiency at large N (up to 1000+), and clear guidance for topology choices and scheduling in reconfigurable direct-connect interconnects. This work enables scalable, high-bandwidth all-to-all communication for ML embeddings, FFTs, and MoE workloads on modern accelerator clusters.
Abstract
The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly challenging workload that can severely strain the underlying interconnect bandwidth at scale. This paper takes a holistic approach to optimize the performance of all-to-all collective communications on supercomputer-scale direct-connect interconnects. We address several algorithmic and practical challenges in developing efficient and bandwidth-optimal all-to-all schedules for any topology and lowering the schedules to various runtimes and interconnect technologies. We also propose a novel topology that delivers near-optimal all-to-all performance.
