TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Swati Gupta, Tushar Krishna
TL;DR
This work tackles the challenge of efficiently coordinating communication in distributed ML across heterogeneous and large-scale network topologies by automating the synthesis of topology-aware collective algorithms. It introduces Time-expanded Networks (TEN) and a network-utilization–maximizing link-chunk matching approach to generate static, topology-aligned collectives without relying on NP-hard global optimizations. TACOS demonstrates strong performance gains over fixed baselines and prior synthesizers, achieving up to 4.27× improvements and scalable synthesis up to 40K NPUs in hours, while preserving near-ideal network utilization. The approach enables practical deployment across diverse ML systems and is backed by extensive simulations and end-to-end training results, with open-source artifacts to facilitate adoption. Overall, TACOS advances scalable, topology-conscious synchronization for distributed ML workloads and offers a tractable path toward automated optimization in future AI supercomputers.
Abstract
The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.
