Table of Contents
Fetching ...

Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe

Mincong Huang, Chao Wang, Chi Ma, Yineng Zhang, Peng Zhang, Lei Yu

TL;DR

This work re-evaluates memory-balanced pipeline parallelism with BPipe in large Transformer training, revealing model- and kernel-dependent benefits (GPT-3 shows gains, LLaMA may degrade) and mixed results when using flash attention. It analyzes the root causes of performance variation—notably attention kernel efficiency and memory-path optimizations—and introduces a MFU-based performance estimation to predict gains from larger micro-batch sizes. The study demonstrates that BPipe's usefulness hinges on hardware, model, and kernel choices, and provides a practical method to estimate potential speedups prior to large-scale deployment. Overall, the paper offers both empirical insights and a quantitative tool to decide when memory-balancing strategies like BPipe are advantageous in practice.

Abstract

Pipeline parallelism is an essential technique in the training of large-scale Transformer models. However, it suffers from imbalanced memory consumption, leading to insufficient memory utilization. The BPipe technique was proposed to address this issue and has proven effective in the GPT-3 model. Nevertheless, our experiments have not yielded similar benefits for LLaMA training. Additionally, BPipe only yields negligible benefits for GPT-3 training when applying flash attention. We analyze the underlying causes of the divergent performance of BPipe on GPT-3 and LLaMA. Furthermore, we introduce a novel method to estimate the performance of BPipe.

Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe

TL;DR

This work re-evaluates memory-balanced pipeline parallelism with BPipe in large Transformer training, revealing model- and kernel-dependent benefits (GPT-3 shows gains, LLaMA may degrade) and mixed results when using flash attention. It analyzes the root causes of performance variation—notably attention kernel efficiency and memory-path optimizations—and introduces a MFU-based performance estimation to predict gains from larger micro-batch sizes. The study demonstrates that BPipe's usefulness hinges on hardware, model, and kernel choices, and provides a practical method to estimate potential speedups prior to large-scale deployment. Overall, the paper offers both empirical insights and a quantitative tool to decide when memory-balancing strategies like BPipe are advantageous in practice.

Abstract

Pipeline parallelism is an essential technique in the training of large-scale Transformer models. However, it suffers from imbalanced memory consumption, leading to insufficient memory utilization. The BPipe technique was proposed to address this issue and has proven effective in the GPT-3 model. Nevertheless, our experiments have not yielded similar benefits for LLaMA training. Additionally, BPipe only yields negligible benefits for GPT-3 training when applying flash attention. We analyze the underlying causes of the divergent performance of BPipe on GPT-3 and LLaMA. Furthermore, we introduce a novel method to estimate the performance of BPipe.
Paper Structure (9 sections, 4 equations, 2 figures, 5 tables)

This paper contains 9 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An illustration of BPipe within 4-way 1F1B pipeline strategy
  • Figure 2: An illustration of pair-adjacent assignment for 16-way pipeline parallelism on two nodes, each with 8 GPUs