Memory-Efficient Pipeline-Parallel DNN Training
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia
TL;DR
PipeDream-2BW tackles memory bottlenecks in pipeline-parallel DNN training for billion-parameter models by introducing double-buffered weight updates (2BW) and a memory-conscious planner that partitions models across accelerators. The 2BW scheme maintains two weight versions, allowing new weights to be used for new inputs while a shadow version handles in-flight computations, yielding high throughput with a convergence profile close to vanilla updates. A variant, PipeDream-Flush, reduces memory further at the cost of throughput, while equi-replicated parallel pipelines balance compute and communication across servers. Activation recomputation further lowers memory usage, enabling larger microbatches; the planner combines per-block timing and memory models to select optimal $(w,d)$ configurations under hardware constraints. Empirically, 2BW achieves up to 20× speedups over non-pipelined baselines and up to 3.2× over GPipe, enabling training of GPT/BERT-scale models up to ~30B parameters on 64 GPUs, with convergence and finetuning accuracy comparable to standard methods. This approach offers a practical path to scalable, memory-conscious training of massive transformer models on commodity hardware.
Abstract
Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models. However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators. In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies. PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20$\times$ with similar final model accuracy.
