Table of Contents
Fetching ...

Memory-Efficient Pipeline-Parallel DNN Training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia

TL;DR

PipeDream-2BW tackles memory bottlenecks in pipeline-parallel DNN training for billion-parameter models by introducing double-buffered weight updates (2BW) and a memory-conscious planner that partitions models across accelerators. The 2BW scheme maintains two weight versions, allowing new weights to be used for new inputs while a shadow version handles in-flight computations, yielding high throughput with a convergence profile close to vanilla updates. A variant, PipeDream-Flush, reduces memory further at the cost of throughput, while equi-replicated parallel pipelines balance compute and communication across servers. Activation recomputation further lowers memory usage, enabling larger microbatches; the planner combines per-block timing and memory models to select optimal $(w,d)$ configurations under hardware constraints. Empirically, 2BW achieves up to 20× speedups over non-pipelined baselines and up to 3.2× over GPipe, enabling training of GPT/BERT-scale models up to ~30B parameters on 64 GPUs, with convergence and finetuning accuracy comparable to standard methods. This approach offers a practical path to scalable, memory-conscious training of massive transformer models on commodity hardware.

Abstract

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20$\times$ with similar final model accuracy.

Memory-Efficient Pipeline-Parallel DNN Training

TL;DR

PipeDream-2BW tackles memory bottlenecks in pipeline-parallel DNN training for billion-parameter models by introducing double-buffered weight updates (2BW) and a memory-conscious planner that partitions models across accelerators. The 2BW scheme maintains two weight versions, allowing new weights to be used for new inputs while a shadow version handles in-flight computations, yielding high throughput with a convergence profile close to vanilla updates. A variant, PipeDream-Flush, reduces memory further at the cost of throughput, while equi-replicated parallel pipelines balance compute and communication across servers. Activation recomputation further lowers memory usage, enabling larger microbatches; the planner combines per-block timing and memory models to select optimal configurations under hardware constraints. Empirically, 2BW achieves up to 20× speedups over non-pipelined baselines and up to 3.2× over GPipe, enabling training of GPT/BERT-scale models up to ~30B parameters on 64 GPUs, with convergence and finetuning accuracy comparable to standard methods. This approach offers a practical path to scalable, memory-conscious training of massive transformer models on commodity hardware.

Abstract

Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20 with similar final model accuracy.

Paper Structure

This paper contains 25 sections, 8 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Timelines of different pipeline-parallel executions. Without loss of generality, forward and backward passes are assumed to take twice as long as forward passes; forward passes are shown in blue and backward passes are shown in green. Numbers indicate microbatch ID, time is shown along $x$-axis, per-worker utilization is shown along the $y$-axis. GPipe maintains a single weight version, but periodically flushes the pipeline. PipeDream does not introduce periodic pipeline flushes, but maintains multiple weight versions.
  • Figure 2: Timeline showing PipeDream-2BW's double-buffered weight update (2BW) scheme with time along $x$-axis. Without loss of generality, backward passes are assumed to take twice as long as forward passes. PipeDream-2BW only stashes two weight versions at every worker, reducing the total memory footprint while no longer requiring expensive pipeline stalls. $W^{(v)}_i$ indicates weights on worker $i$ with version $v$ (contains weight gradient generated from input $v$). New weight versions are generated in checkered green boxes; $W_4^{(4)}$ is first used for input 9's forward pass.
  • Figure 3: Timelines of GPipe and PipeDream-Flush for 2 stages. Both GPipe and PipeDream-Flush use pipeline flushes; PipeDream-Flush alternates between forward and backward passes in steady state to keeping memory footprint low compared to GPipe by limiting activation stashes to only in-flight microbatches.
  • Figure 4: Example PipeDream-2BW ($2, 3$) configuration. The model is partitioned into 3 stages ($d=3$) and each pipeline is replicated twice ($w=2$). Each pipeline replica is shown in a different color.
  • Figure 5: Training and validation loss when pre-training BERT and GPT models with vanilla Adam and Adam with 2BW.
  • ...and 7 more figures