A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Siddharth Singh; Prajwal Singhania; Aditya K. Ranjan; Zack Sating; Abhinav Bhatele

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating, Abhinav Bhatele

TL;DR

This work tackles the critical bottleneck of communication in large-scale parallel training by introducing a $4D$ hybrid parallel framework (AxoNN) that combines $3D$ tensor parallelism with data parallelism. It employs asynchronous overlap of collective operations and a topology-aware performance model to identify a small set of near-optimal configurations for a given workload, significantly reducing non-overlapped communication. Empirical results on GPT-scale transformers and U-Nets show substantial improvements over Megatron-LM, DeepSpeed-3D, and ZeRO-3 in both strong and weak scaling, including achieving up to 57% of peak FLOP/s on 1024 GPUs and reductions in communication overhead by orders of magnitude. The work provides both a practical framework and analytical tools to guide configuration, enabling scalable, efficient training on heterogeneous, large-scale GPU clusters.

Abstract

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

TL;DR

This work tackles the critical bottleneck of communication in large-scale parallel training by introducing a

hybrid parallel framework (AxoNN) that combines

tensor parallelism with data parallelism. It employs asynchronous overlap of collective operations and a topology-aware performance model to identify a small set of near-optimal configurations for a given workload, significantly reducing non-overlapped communication. Empirical results on GPT-scale transformers and U-Nets show substantial improvements over Megatron-LM, DeepSpeed-3D, and ZeRO-3 in both strong and weak scaling, including achieving up to 57% of peak FLOP/s on 1024 GPUs and reductions in communication overhead by orders of magnitude. The work provides both a practical framework and analytical tools to guide configuration, enabling scalable, efficient training on heterogeneous, large-scale GPU clusters.

Abstract

Paper Structure (27 sections, 4 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Tensor Parallelism
Modeling Communication Performance
Designing a Hybrid 3D Tensor and Data Parallel Framework
Data Parallelism
Three-dimensional Tensor Parallelism
Performance Optimizations in AxoNN
Overlapping All-Reduces with Computation
Overlapping Reduce-Scatters with Computation
Overlapping All-Gathers with Computation
Caching Outputs of All-Gathers
A Performance Model of Collective Communication in Parallel Training
Placement-agnostic Performance Model
Placement-aware Performance Model
...and 12 more sections

Figures (12)

Figure 1: Computation in the forward pass of a fully-connected (FC) layer with input $I$ and layer weights $W$. The output, $O$ is a matrix multiplication of $I$ and $W$. We assume $I \in \mathbb{R}^{m \times k}$, $W \in \mathbb{R}^{k \times n}$, and $O \in \mathbb{R}^{m \times n}$.
Figure 2: Parallelization of an FC layer with Agarwal's 3D parallel matrix multiplication algorithm agarwal-3d on eight GPUs organized in a $2\times2\times2$ topology. We use $G_x$, $G_y$, and $G_z$ to refer to the number of GPUs along the three dimensions of the virtual grid topology.
Figure 3: Studying the effect of the proposed communication optimizations on the training times of a GPT 20B model on 16 GPUs of Perlmutter. We use Pipit bhatele:2023pipit for creating these breakdowns from trace data collected using the PyTorch Profiler pytorch-profiler.
Figure 4: PyTorch Profiler traces demonstrating i. (top) overlap of reduce scatters with backward pass compute as discussed in Section \ref{['sec:opt-depth-rs']}, ii. (middle) all-gathers without any overlap in the forward pass, and iii. (bottom) all-gathers after introducing the overlap optimization in Section \ref{['sec:opt-depth-ag']}. The first row in every trace corresponds to the compute stream and the others are communication streams.
Figure 5: Collective communication operations (all-reduce/reduce-scatter/all-gather) using the ring algorithm, spanning eight GPUs on two nodes.
...and 7 more figures

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

TL;DR

Abstract

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (12)