Table of Contents
Fetching ...

Pipeline Parallelism with Controllable Memory

Penghui Qi, Xinyi Wan, Nyamdavaa Amar, Min Lin

TL;DR

This paper proposes a framework to decompose pipeline schedules as repeating a building block, and shows that the lifespan of the building block decides the peak activation memory of the pipeline schedule.

Abstract

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism.

Pipeline Parallelism with Controllable Memory

TL;DR

This paper proposes a framework to decompose pipeline schedules as repeating a building block, and shows that the lifespan of the building block decides the peak activation memory of the pipeline schedule.

Abstract

Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism.
Paper Structure (35 sections, 4 equations, 18 figures, 10 tables)

This paper contains 35 sections, 4 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: A pipeline can be built by repeating a building block, and then squeezing redundant bubbles.
  • Figure 2: The V-Shape building block ensures balanced peak memory across all devices, whereas the parallel building block has a memory bottleneck in the first device.
  • Figure 3: V-Shape building blocks with 4 devices ($d=4$), where white text colors represent the first half of model stages and black text colors represent the second half. F, B and W represent the forward, backward (for activation gradients) and backward for weight gradients, respectively.
  • Figure 4: V-Shape schedules compared to 1F1B, under the setting of 4 devices and 8 microbatches. The stable phases adhere to the pattern of their building blocks.
  • Figure 5: We take a repeating $d \times T$ grid from V-Min and V-Half schedules, and assign F/B/W with different values. The result shows V-Min has bubbles for every repeating grid, while V-Half does not.
  • ...and 13 more figures