Table of Contents
Fetching ...

The Big Send-off: High Performance Collectives on GPU-based Supercomputers

Siddharth Singh, Mahua Singh, Abhinav Bhatele

TL;DR

The paper addresses the bottleneck of collective communication in large-scale DL training on GPU supercomputers, focusing on all-gather and reduce-scatter. It introduces PCCL, a hierarchical, GPU-optimized library that combines inter-node recursive-halving/doubling with intra-node RCCL and a device-local shuffle to balance NIC traffic and reduce latency. Empirical results on Frontier and Perlmutter show PCCL achieving up to 6–33x (all-gather) and 28–70x (reduce-scatter) speedups over RCCL and Cray-MPICH, with substantial end-to-end training gains for GPT-3–style models (up to 60% on Frontier for 7B and 40% for 13B models). The work demonstrates that algorithmic diversification, careful resource balancing, and GPU-accelerated reductions can dramatically improve scalable DL training, enabling more efficient use of next-generation GPU systems.

Abstract

We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier -- Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.

The Big Send-off: High Performance Collectives on GPU-based Supercomputers

TL;DR

The paper addresses the bottleneck of collective communication in large-scale DL training on GPU supercomputers, focusing on all-gather and reduce-scatter. It introduces PCCL, a hierarchical, GPU-optimized library that combines inter-node recursive-halving/doubling with intra-node RCCL and a device-local shuffle to balance NIC traffic and reduce latency. Empirical results on Frontier and Perlmutter show PCCL achieving up to 6–33x (all-gather) and 28–70x (reduce-scatter) speedups over RCCL and Cray-MPICH, with substantial end-to-end training gains for GPT-3–style models (up to 60% on Frontier for 7B and 40% for 13B models). The work demonstrates that algorithmic diversification, careful resource balancing, and GPU-accelerated reductions can dramatically improve scalable DL training, enabling more efficient use of next-generation GPU systems.

Abstract

We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier -- Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.

Paper Structure

This paper contains 23 sections, 2 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Performance comparison of all-gather using Cray-MPICH vs. RCCL on Frontier for two output buffer sizes of 64 and 128 MB. The ideal scaling behavior (flat horizontal line) is not achieved by either library, highlighting their limited scalability at increasing GCD counts.
  • Figure 2: Distribution of all-gather and reduce-scatter message sizes for several deep learning frameworks for a range of transformer transformer model sizes. The y-axis represents input buffer sizes for all-gathers but output buffer sizes for reduce-scatters.
  • Figure 3: The left plot compares all-gather performance of Cray MPICH and RCCL on Frontier for a bandwidth-bound scenario with large message sizes (256 and 512 MB) and small GPU counts. The middle and right plot show the number of packets read from (left) and written to (right) each of the four NICs on a Frontier compute node during all-gather operations.
  • Figure 4: Performance comparison of reduce-scatter using Cray MPICH, RCCL, and a custom implementation of reduce-scatter that uses Cray MPICH P2P and GPU compute kernels.
  • Figure 5: Diagram showing our hierarchical (two-level) implementation to dissolve an all-gather operation on a GPU-based cluster with N nodes and M GPUs per node. In Step 1, we performs inter-node all-gathers, in step 2, we perform intra-node all-gathers and in step 3, each GPU performs a local shuffle of the received data.
  • ...and 7 more figures