Table of Contents
Fetching ...

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He

TL;DR

ZeRO++ addresses the critical bottleneck of inter-GPU communication in giant-model training by introducing three coordinated techniques: qwZ (block-based INT8 weight quantization for forward all-gather), hpZ (on-node secondary weight partitions to remove inter-node backward communication), and qgZ (an all-to-all, INT4-based gradient reduction with hierarchical two-hop communication and tensor reordering). Together, these reduce cross-node communication from 3M to 0.75M per iteration, delivering up to 2.16x–2.4x end-to-end throughput gains on up to 384 GPUs, while preserving model convergence. The approach combines custom high-performance CUDA kernels, overlapping compute with communication, and pipeline-friendly gradient coordination to achieve near-linear scalability across bandwidths. ZeRO++ is released with DeepSpeed, enabling practical, scalable training of trillion-parameter-scale models on heterogeneous clusters.

Abstract

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

TL;DR

ZeRO++ addresses the critical bottleneck of inter-GPU communication in giant-model training by introducing three coordinated techniques: qwZ (block-based INT8 weight quantization for forward all-gather), hpZ (on-node secondary weight partitions to remove inter-node backward communication), and qgZ (an all-to-all, INT4-based gradient reduction with hierarchical two-hop communication and tensor reordering). Together, these reduce cross-node communication from 3M to 0.75M per iteration, delivering up to 2.16x–2.4x end-to-end throughput gains on up to 384 GPUs, while preserving model convergence. The approach combines custom high-performance CUDA kernels, overlapping compute with communication, and pipeline-friendly gradient coordination to achieve near-linear scalability across bandwidths. ZeRO++ is released with DeepSpeed, enabling practical, scalable training of trillion-parameter-scale models on heterogeneous clusters.

Abstract

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.
Paper Structure (35 sections, 3 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 35 sections, 3 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Large scale training throughput are constrained by network bandwidth and batch size per GPU
  • Figure 2: Illustration & example of block based quantization vs. baseline
  • Figure 3: hpZ removes cross node traffic in backward all-gather by holding secondary weight partitions in on-device memory.
  • Figure 4: Per-device memory consumption analysis of standard data parallel (DP), ZeRO stage 3 (ZeRO-3) and proposed hierarchical partitioning of ZeRO parameters ($hpZ$). $K$ denotes the memory multiplier of optimizer states, $M$ represents the number of trainable parameters, $P$ is the data parallel group size or world size, and $\alpha$ is the number of secondary groups or ratio of world size to the number of ranks in the secondary group. A typical real world scenario example is provided in the last column. We assume a model size of 100B trained on 1024 V100 GPU DGX cluster (64 compute nodes, 16 GPUs per node).
  • Figure 5: Comparison between ZeRO-3 ring-based reduce-scatter and qgZ 1-hop all-to-all.
  • ...and 9 more figures