Table of Contents
Fetching ...

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, Tushar Krishna

TL;DR

This work tackles the challenge of efficiently coordinating communication in distributed ML across heterogeneous and large-scale network topologies by automating the synthesis of topology-aware collective algorithms. It introduces Time-expanded Networks (TEN) and a network-utilization–maximizing link-chunk matching approach to generate static, topology-aligned collectives without relying on NP-hard global optimizations. TACOS demonstrates strong performance gains over fixed baselines and prior synthesizers, achieving up to 4.27× improvements and scalable synthesis up to 40K NPUs in hours, while preserving near-ideal network utilization. The approach enables practical deployment across diverse ML systems and is backed by extensive simulations and end-to-end training results, with open-source artifacts to facilitate adoption. Overall, TACOS advances scalable, topology-conscious synchronization for distributed ML workloads and offers a tractable path toward automated optimization in future AI supercomputers.

Abstract

The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

TL;DR

This work tackles the challenge of efficiently coordinating communication in distributed ML across heterogeneous and large-scale network topologies by automating the synthesis of topology-aware collective algorithms. It introduces Time-expanded Networks (TEN) and a network-utilization–maximizing link-chunk matching approach to generate static, topology-aligned collectives without relying on NP-hard global optimizations. TACOS demonstrates strong performance gains over fixed baselines and prior synthesizers, achieving up to 4.27× improvements and scalable synthesis up to 40K NPUs in hours, while preserving near-ideal network utilization. The approach enables practical deployment across diverse ML systems and is backed by extensive simulations and end-to-end training results, with open-source artifacts to facilitate adoption. Overall, TACOS advances scalable, topology-conscious synchronization for distributed ML workloads and offers a tractable path toward automated optimization in future AI supercomputers.

Abstract

The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.
Paper Structure (38 sections, 23 figures, 5 tables, 2 algorithms)

This paper contains 38 sections, 23 figures, 5 tables, 2 algorithms.

Figures (23)

  • Figure 1: Heat map of total message size transferred over each link, when running 1 GB All-Reduce using basic algorithms (Direct, RHD, Ring, and Tacos for comparison) over different network topologies (FullyConnected (FC), Ring, 2D Mesh, and 3D Hypercube (HC)). Each cell at (src, dest) denotes a link connecting NPU src to NPU dest. If there is no such link in the topology, the cell is marked black. All values are normalized to the largest value per topology. For every topology, scenarios running topology-aware collective algorithms are marked with a red box, which results in balanced, lowest overall loads (i.e., a cooler heat map).
  • Figure 2: (a) All-Reduce bandwidth (i.e., collective size $\div$ collective time) over distinct network topologies with 64 NPUs (using link $\alpha$=0.5µs, 1/$\beta$=50GB/s, as explained in \ref{['subsec:tacos_extension']}), measured with 1 GB collective. For 2D Mesh and 3D Hypercube, we also run the topology-aware algorithm synthesized by Tacos (T). (b) All-Reduce bandwidth for different collective algorithms on a 128-NPU physical Ring (link $\alpha$=30ns, 1/$\beta$=150GB/s), measured with varying collective sizes. Results are normalized per each graph by the smallest number.
  • Figure 3: (a) Current CCLs execute collectives by selecting an algorithm from a set of predefined implementations. (b) High-level overview of the Tacos framework. The target network topology and collective pattern are provided as inputs. Tacos expands the network into a TEN and evaluates the collective precondition (i.e., which chunks are currently held by each NPU) for time $t$=0. Based on this information, Tacos employs a chunk-link matching algorithm that maximizes network resource utilization for the target time $t$=0. Tacos iteratively runs this matching process for successive time spans until the postcondition is satisfied (i.e., all NPUs have received every desired chunk). The details are explained in \ref{['sec:greedyBased']}. This procedure yields a topology-aware collective algorithm (i.e., static path of each chunk), which can then be utilized by CCLs in lieu of the predefined topology-unaware basic algorithms.
  • Figure 4: Common collective patterns used in distributed ML. Each collective communication defines a specific data transfer pattern. For example, in All-Gather, each NPU starts with one chunk (denoted as a circle) and broadcasts it to all other NPUs. In Reduce-Scatter, all NPUs start with $N$ chunks and end up with one chunk, constructed by summing the corresponding chunks from all other NPUs.
  • Figure 5: Example All-Reduce algorithms and their traffic patterns. In Reduce-Scatter, each red arrow represents sending and adding a chunk to the local data. For All-Gather, each blue arrow signifies forwarding a chunk. Depending on the network connectivity, each step may encounter link congestion and underutilization.
  • ...and 18 more figures