TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li; Siyuan Zhuang; Shiyuan Guo; Danyang Zhuo; Hao Zhang; Dawn Song; Ion Stoica

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

TL;DR

TeraPipe introduces token-level pipeline parallelism to train large-scale Transformer language models by slicing along the token dimension within a single input sequence. A dynamic programming-based algorithm computes optimal token slices to maximize pipeline throughput while accounting for forward/backward latency and hardware constraints, and it can be combined with existing data and model parallel strategies. Empirical results on GPT-3 scale models show substantial speedups, particularly for the 175B parameter model, with up to 6.75x improvement in certain configurations and consistent gains as sequence length increases. The work provides a practical, orthogonal approach to accelerate synchronous training without altering model accuracy, enabling more efficient training of next-generation LMs on large clusters.

Abstract

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

TL;DR

Abstract

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)