The Big Send-off: High Performance Collectives on GPU-based Supercomputers
Siddharth Singh, Mahua Singh, Abhinav Bhatele
TL;DR
The paper addresses the bottleneck of collective communication in large-scale DL training on GPU supercomputers, focusing on all-gather and reduce-scatter. It introduces PCCL, a hierarchical, GPU-optimized library that combines inter-node recursive-halving/doubling with intra-node RCCL and a device-local shuffle to balance NIC traffic and reduce latency. Empirical results on Frontier and Perlmutter show PCCL achieving up to 6–33x (all-gather) and 28–70x (reduce-scatter) speedups over RCCL and Cray-MPICH, with substantial end-to-end training gains for GPT-3–style models (up to 60% on Frontier for 7B and 40% for 13B models). The work demonstrates that algorithmic diversification, careful resource balancing, and GPU-accelerated reductions can dramatically improve scalable DL training, enabling more efficient use of next-generation GPU systems.
Abstract
We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier -- Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.
