2BP: 2-Stage Backpropagation
Christopher Rae, Joseph K. L. Lee, James Richings
TL;DR
This work tackles memory bottlenecks and idle compute in pipeline-parallel training of large DNNs by introducing 2-stage backpropagation (2BP), which splits the backward pass into backward-p1 and backward-p2 and delays p2 to improve accelerator utilization. Implemented atop PyTorch without using autograd, 2BP can augment any pipeline schedule and is demonstrated across four model families, achieving up to 1.70x throughput gains on a 7B transformer with 4 GPUs. However, 2BP increases peak memory usage due to storing intermediate derivatives and longer-lived activations, with gains varying by model and schedule; results show both substantial throughput improvements and increased memory demands. Overall, 2BP offers a practical path to accelerating multi-GPU training of very large models and motivates closer integration with framework-level differentiation control and memory-management strategies.
Abstract
As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.
