A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating, Abhinav Bhatele
TL;DR
This work tackles the critical bottleneck of communication in large-scale parallel training by introducing a $4D$ hybrid parallel framework (AxoNN) that combines $3D$ tensor parallelism with data parallelism. It employs asynchronous overlap of collective operations and a topology-aware performance model to identify a small set of near-optimal configurations for a given workload, significantly reducing non-overlapped communication. Empirical results on GPT-scale transformers and U-Nets show substantial improvements over Megatron-LM, DeepSpeed-3D, and ZeRO-3 in both strong and weak scaling, including achieving up to 57% of peak FLOP/s on 1024 GPUs and reductions in communication overhead by orders of magnitude. The work provides both a practical framework and analytical tools to guide configuration, enabling scalable, efficient training on heterogeneous, large-scale GPU clusters.
Abstract
Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
