Table of Contents
Fetching ...

2BP: 2-Stage Backpropagation

Christopher Rae, Joseph K. L. Lee, James Richings

TL;DR

This work tackles memory bottlenecks and idle compute in pipeline-parallel training of large DNNs by introducing 2-stage backpropagation (2BP), which splits the backward pass into backward-p1 and backward-p2 and delays p2 to improve accelerator utilization. Implemented atop PyTorch without using autograd, 2BP can augment any pipeline schedule and is demonstrated across four model families, achieving up to 1.70x throughput gains on a 7B transformer with 4 GPUs. However, 2BP increases peak memory usage due to storing intermediate derivatives and longer-lived activations, with gains varying by model and schedule; results show both substantial throughput improvements and increased memory demands. Overall, 2BP offers a practical path to accelerating multi-GPU training of very large models and motivates closer integration with framework-level differentiation control and memory-management strategies.

Abstract

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

2BP: 2-Stage Backpropagation

TL;DR

This work tackles memory bottlenecks and idle compute in pipeline-parallel training of large DNNs by introducing 2-stage backpropagation (2BP), which splits the backward pass into backward-p1 and backward-p2 and delays p2 to improve accelerator utilization. Implemented atop PyTorch without using autograd, 2BP can augment any pipeline schedule and is demonstrated across four model families, achieving up to 1.70x throughput gains on a 7B transformer with 4 GPUs. However, 2BP increases peak memory usage due to storing intermediate derivatives and longer-lived activations, with gains varying by model and schedule; results show both substantial throughput improvements and increased memory demands. Overall, 2BP offers a practical path to accelerating multi-GPU training of very large models and motivates closer integration with framework-level differentiation control and memory-management strategies.

Abstract

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.
Paper Structure (14 sections, 7 figures, 3 tables)

This paper contains 14 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Pipelining schedules (Naive, Gpipe, 1F1B-1, 1F1B-2) with and without 2BP. This figure assumes that the time taken to compute the forward, backward-p1 and backward-p2 passes are equal.
  • Figure 2: Combining each microbatch's backward-p2 into a single operation.
  • Figure 3: Sample throughput for each model with different pipeline schedules. Light blue bars represent schedule without 2BP, and dark blue represents with 2BP. Numbers above bars represent throughtput gain from using 2BP.
  • Figure 4: Maximum memory usage across the 4 GPUs. The "peak memory (GB)" is measured by obtaining the peak reserved memory on each GPU and taking then the maximum. Light blue bars represent schedule without 2BP, and dark blue represents with 2BP. Numbers above bars represent the increase in memory from using 2BP.
  • Figure 5: Alternative memory efficient schedule for 1F1B-2 with 2BP
  • ...and 2 more figures