Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
Mincong Huang, Chao Wang, Chi Ma, Yineng Zhang, Peng Zhang, Lei Yu
TL;DR
This work re-evaluates memory-balanced pipeline parallelism with BPipe in large Transformer training, revealing model- and kernel-dependent benefits (GPT-3 shows gains, LLaMA may degrade) and mixed results when using flash attention. It analyzes the root causes of performance variation—notably attention kernel efficiency and memory-path optimizations—and introduces a MFU-based performance estimation to predict gains from larger micro-batch sizes. The study demonstrates that BPipe's usefulness hinges on hardware, model, and kernel choices, and provides a practical method to estimate potential speedups prior to large-scale deployment. Overall, the paper offers both empirical insights and a quantitative tool to decide when memory-balancing strategies like BPipe are advantageous in practice.
Abstract
Pipeline parallelism is an essential technique in the training of large-scale Transformer models. However, it suffers from imbalanced memory consumption, leading to insufficient memory utilization. The BPipe technique was proposed to address this issue and has proven effective in the GPT-3 model. Nevertheless, our experiments have not yielded similar benefits for LLaMA training. Additionally, BPipe only yields negligible benefits for GPT-3 training when applying flash attention. We analyze the underlying causes of the divergent performance of BPipe on GPT-3 and LLaMA. Furthermore, we introduce a novel method to estimate the performance of BPipe.
