Table of Contents
Fetching ...

Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

Michael Adams, Amanda Bienz

TL;DR

Novel optimizations to large GPU-aware all-reduce operations are presented by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations.

Abstract

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This paper presents novel optimizations to large GPU-aware all-reduce operations by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations. Using GPUDirect RDMA and host copy communications respectively, these multi-CPU-accelerated GPU-aware all-reduces yield speedups over system MPI of up to $3$x on LLNL's Tuolumne supercomputer and up to $2.45$x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer.

Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

TL;DR

Novel optimizations to large GPU-aware all-reduce operations are presented by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations.

Abstract

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This paper presents novel optimizations to large GPU-aware all-reduce operations by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations. Using GPUDirect RDMA and host copy communications respectively, these multi-CPU-accelerated GPU-aware all-reduces yield speedups over system MPI of up to x on LLNL's Tuolumne supercomputer and up to x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer.

Paper Structure

This paper contains 15 sections, 18 figures.

Figures (18)

  • Figure 1: Example All-Reduce Setup
  • Figure 2: Ring All-Reduce, First Step
  • Figure 3: Lane All-Reduce, Inter-Node Step
  • Figure 4: Standard MPI_Allreduce vs our optimized all-reduce on LLNL's Tuolumne
  • Figure 5: A ping-pong benchmark using multiple active processes per (logical) GPU featuring performance on Delta (left), and the SPX (middle) and CPX (right) modes of the AMD MI300A on Tuolumne
  • ...and 13 more figures