Table of Contents
Fetching ...

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

TL;DR

TeraPipe introduces token-level pipeline parallelism to train large-scale Transformer language models by slicing along the token dimension within a single input sequence. A dynamic programming-based algorithm computes optimal token slices to maximize pipeline throughput while accounting for forward/backward latency and hardware constraints, and it can be combined with existing data and model parallel strategies. Empirical results on GPT-3 scale models show substantial speedups, particularly for the 175B parameter model, with up to 6.75x improvement in certain configurations and consistent gains as sequence length increases. The work provides a practical, orthogonal approach to accelerate synchronous training without altering model accuracy, enabling more efficient training of next-generation LMs on large clusters.

Abstract

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

TL;DR

TeraPipe introduces token-level pipeline parallelism to train large-scale Transformer language models by slicing along the token dimension within a single input sequence. A dynamic programming-based algorithm computes optimal token slices to maximize pipeline throughput while accounting for forward/backward latency and hardware constraints, and it can be combined with existing data and model parallel strategies. Empirical results on GPT-3 scale models show substantial speedups, particularly for the 175B parameter model, with up to 6.75x improvement in certain configurations and consistent gains as sequence length increases. The work provides a practical, orthogonal approach to accelerate synchronous training without altering model accuracy, enabling more efficient training of next-generation LMs on large clusters.

Abstract

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

Paper Structure

This paper contains 15 sections, 8 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Different approaches of model parallel training of Transformer-based LMs. (a) shows a standard multi-layer Transformer LM. In each layer, each position only takes only its previous positions as input. (b) shows operation partitioning shoeybi2019megatron. An allreduce operation is required to synchronize the results of each layer. (c) shows microbatch-based pipeline parallelism huang2019gpipe, which allows different microbatches (red and green bars) to be executed on different layers of the DNN in parallel. (d) show TeraPipe (our work), which pipelines along the token dimension.
  • Figure 2: Execution timeline for different pipelining methods. Grey blocks indicate GPUs idle time (a.k.a. pipeline bubbles). (a) Microbatch-based pipeline parallelism (e.g. GPipe). Each color corresponds to a microbatch. (b) Microbatch-based pipeline parallelism with longer sequence (hence smaller minibatch size due to fixed GPU memory). Pipeline bubbles significantly increase. (c) TeraPipe. Pipeline bubbles are substantially reduced because of the improved pipelining granularity.
  • Figure 3: Forward propagation time and throughput for a single layer of GPT3-1B model with a single input sequence with different number of input tokens on a single NVIDIA V100 GPU, averaged by 30 independent runs. Top: Time per forward propagation. Bottom: Throughput measured by number of tokens per millisecond.
  • Figure 4: Execution timeline for inputs for uniform sequence split with non-uniform running time (top) and non-uniform sequence split with uniform running time (bottom). The total latency of a pipeline is determined by its slowest stage, and thus splits with non-uniform running time result in larger pipeline bubbles and inferior pipeline efficiency.
  • Figure 5: Training iteration latency for all configurations with and without TeraPipe. Details for each configuration are listed in Table \ref{['tbl:setting']}.
  • ...and 2 more figures