Table of Contents
Fetching ...

Synergistic Tensor and Pipeline Parallelism

Mengshi Qi, Jiaxuan Peng, Jie Zhang, Juan Zhu, Yong Li, Huadong Ma

TL;DR

This work tackles the dual bottlenecks of tensor and pipeline parallelism in distributed training of very large models by introducing braided execution blocks that decouple and braid forward and backward PP computations to overlap TP communication. Building on these blocks, it proposes a synergistic PP schedule with a V-shaped dataflow to balance memory and significantly reduce both TP and PP bubbles, with an enhanced variant that offloads activations to mitigate memory pressure. The approach achieves consistent throughput gains across LLMs and multimodal LLMs, up to about 12–16% over strong baselines, at the expense of higher peak memory—mitigated by the offloading technique. The results demonstrate practical applicability on large-scale hardware and provide a framework for memory-aware, high-throughput hybrid parallelism in future large-model training.

Abstract

In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs (MLLMs). However, TP introduces significant collective communication overheads, while PP suffers from synchronization inefficiencies such as pipeline bubbles. Existing works primarily address these challenges from isolated perspectives, focusing either on overlapping TP communication or on flexible PP scheduling to mitigate pipeline bubbles. In this paper, we propose a new synergistic tensor and pipeline parallelism schedule that simultaneously reduces both types of bubbles. Our proposed schedule decouples the forward and backward passes in PP into fine-grained computation units, which are then braided to form a composite computation sequence. This compositional structure enables near-complete elimination of TP-related bubbles. Building upon this structure, we further design the PP schedule to minimize PP bubbles. Experimental results demonstrate that our approach improves training throughput by up to 12% for LLMs and 16% for MLLMs compared to existing scheduling methods. Our source code is avaiable at https://github.com/MICLAB-BUPT/STP.

Synergistic Tensor and Pipeline Parallelism

TL;DR

This work tackles the dual bottlenecks of tensor and pipeline parallelism in distributed training of very large models by introducing braided execution blocks that decouple and braid forward and backward PP computations to overlap TP communication. Building on these blocks, it proposes a synergistic PP schedule with a V-shaped dataflow to balance memory and significantly reduce both TP and PP bubbles, with an enhanced variant that offloads activations to mitigate memory pressure. The approach achieves consistent throughput gains across LLMs and multimodal LLMs, up to about 12–16% over strong baselines, at the expense of higher peak memory—mitigated by the offloading technique. The results demonstrate practical applicability on large-scale hardware and provide a framework for memory-aware, high-throughput hybrid parallelism in future large-model training.

Abstract

In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs (MLLMs). However, TP introduces significant collective communication overheads, while PP suffers from synchronization inefficiencies such as pipeline bubbles. Existing works primarily address these challenges from isolated perspectives, focusing either on overlapping TP communication or on flexible PP scheduling to mitigate pipeline bubbles. In this paper, we propose a new synergistic tensor and pipeline parallelism schedule that simultaneously reduces both types of bubbles. Our proposed schedule decouples the forward and backward passes in PP into fine-grained computation units, which are then braided to form a composite computation sequence. This compositional structure enables near-complete elimination of TP-related bubbles. Building upon this structure, we further design the PP schedule to minimize PP bubbles. Experimental results demonstrate that our approach improves training throughput by up to 12% for LLMs and 16% for MLLMs compared to existing scheduling methods. Our source code is avaiable at https://github.com/MICLAB-BUPT/STP.

Paper Structure

This paper contains 27 sections, 2 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Speedup of overlapping TP communication with computation in PP within a Transformer layer of Qwen2 yang2024qwen2technicalreport in forward pass. The proportion of TP communications grows significantly with the increased TP size, which are effectively overlapped in our schedule compared with the naive implementation.
  • Figure 2: Illustration of the main computation and communication operation in a single TP Transformer layer. $f$ is an identity operation in forward while All-Reduce in backward. $g$ is opposite to $f$.
  • Figure 3: Two types of execution blocks that braid the forward (F) and backward (B) computation units to overlap the TP communications (All-Reduce), denoted as AR.
  • Figure 4: Comparison between parallel and "V"-shape pipeline flow for microbatch 1, which shows improved memory balance across stages. That is attributed to the early backward pass on device 0.
  • Figure 5: Synergistic tensor and pipeline parallel schedule with the setting of 4 devices and 12 microbatches. The dark and light pieces indicate the computation of model chunks 0 and 1, respectively. F, B, and W represent the forward, backward, and backward for weight gradients, respectively.
  • ...and 8 more figures