Table of Contents
Fetching ...

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

Ding Tang, Lijuan Jiang, Jiecheng Zhou, Minxi Jin, Hengjie Li, Xingcheng Zhang, Zhilin Pei, Jidong Zhai

TL;DR

The paper tackles the inefficiency of tensor parallelism (TP) within 3D parallelism for large-scale training by proposing TP-free ZeroPP, which combines scalable pipeline parallelism with ZeRO-3 intra-node data parallelism and inter-node data parallelism. ZeroPP introduces ZeRO-compatible PP scheduling, memory reuse via scheduling units, and an activation recomputation strategy to reduce memory footprint while maintaining high utilization; it also offers two hybrid configurations (ZeRO-3 + PP + DP and ZeRO-3 + PP + ZeRO-1) to balance memory and communication. Theoretical analysis shows ZeroPP can reduce per-iteration communication to $\frac{36 B h^2}{U}$ and be advantageous when $2 s U b > 9 h$, while experiments on up to 64 GPUs report up to 33% faster throughput than conventional 3D parallelism with similar memory. Overall, ZeroPP provides a practical, scalable TP-free approach for large transformer models, reducing code complexity and communication overhead, with recomputation helping to further curb memory pressure.

Abstract

Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication overheads and complexity in modifying single-GPU code. In this paper, we propose a TP-free distributed framework ZeroPP, which leverages the hybrid of scalable inter-operator pipeline parallelism and intra-operator fully sharded data parallelism to train models at scale, reducing memory consumption and enabling high training efficiency. Through extensive experimentation, we demonstrate that ZeroPP achieves significant performance gains of up to 33% compared to conventional 3D parallelism while maintaining comparable GPU memory consumption.

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology

TL;DR

The paper tackles the inefficiency of tensor parallelism (TP) within 3D parallelism for large-scale training by proposing TP-free ZeroPP, which combines scalable pipeline parallelism with ZeRO-3 intra-node data parallelism and inter-node data parallelism. ZeroPP introduces ZeRO-compatible PP scheduling, memory reuse via scheduling units, and an activation recomputation strategy to reduce memory footprint while maintaining high utilization; it also offers two hybrid configurations (ZeRO-3 + PP + DP and ZeRO-3 + PP + ZeRO-1) to balance memory and communication. Theoretical analysis shows ZeroPP can reduce per-iteration communication to and be advantageous when , while experiments on up to 64 GPUs report up to 33% faster throughput than conventional 3D parallelism with similar memory. Overall, ZeroPP provides a practical, scalable TP-free approach for large transformer models, reducing code complexity and communication overhead, with recomputation helping to further curb memory pressure.

Abstract

Large-scale models rely heavily on 3D parallelism for distributed training, which utilizes tensor parallelism (TP) as the intra-operator parallelism to partition model states across GPUs. However, TP introduces significant communication overheads and complexity in modifying single-GPU code. In this paper, we propose a TP-free distributed framework ZeroPP, which leverages the hybrid of scalable inter-operator pipeline parallelism and intra-operator fully sharded data parallelism to train models at scale, reducing memory consumption and enabling high training efficiency. Through extensive experimentation, we demonstrate that ZeroPP achieves significant performance gains of up to 33% compared to conventional 3D parallelism while maintaining comparable GPU memory consumption.
Paper Structure (20 sections, 2 equations, 4 figures, 4 tables)

This paper contains 20 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Communication efficiency comparison between TP and ZeRO-3. We test the 6.2B GPT model parallelized with the two methods, respectively, across 8 GPUs with varying global batch sizes, and estimate the per-GPU communication volume for each method. Consequently, ZeRO-3 gains superior communication efficiency, and the performance gap becomes more pronounced as the global batch size increases.
  • Figure 2: Illustration of the proposed ZeRO-compatible PP schedules. In the three PP schedules presented from top to bottom, effective solutions are utilized pertaining to communication, pipeline bubbles, and memory usage at larger batch sizes in the integration of PP and ZeRO. In particular, the micro-batches are divided into multiple groups, and the computations corresponding to each micro-batch group are regarded as a scheduling unit.
  • Figure 3: Illustration on forward pass communications in the combination of PP and ZeRO. Digits from 1 to 3 are the indices of micro-batches, and letters from A to D are the indices of layers. $W_i$ represents the parameter gathering communication of the $i$-th layer.
  • Figure 4: Comparison of the iteration time and actual GPU memory consumption between ZeroPP and 3D parallelism under equivalent GPU memory constraints.