Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Yujie Wang; Youhe Jiang; Xupeng Miao; Fangcheng Fu; Shenhan Zhu; Xiaonan Nie; Yaofeng Tu; Bin Cui

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui

TL;DR

This work tackles the challenge of efficiently training Transformer models across multiple GPUs by automatically exploring a rich, five-dimensional parallelism space that includes DP, SDP, TP, PP, and CKPT. It introduces Galvatron-BMW, which uses a decision-tree decomposition to prune the search space and a dynamic programming search to identify optimal hybrid plans, augmented with a bi-objective workload-balancing framework to maximize hardware utilization. The system estimates compute, communication, and memory costs, accounting for overlapping compute and communication and CKPT recomputation, and is implemented atop PyTorch with NCCL support. Empirical results across NLP and CV models demonstrate substantial throughput improvements over state-of-the-art baselines, with up to 530% gains over pure Parallelisms and strong gains over prior automatic-hybrid approaches, under diverse hardware budgets. The practical impact is a scalable, user-friendly automatic parallel training workflow that can handle large Transformer models with heterogeneous memory footprints and interconnects.

Abstract

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

TL;DR

Abstract

Paper Structure (35 sections, 9 equations, 9 figures, 6 tables, 3 algorithms)

This paper contains 35 sections, 9 equations, 9 figures, 6 tables, 3 algorithms.

Introduction
Preliminary
Transformer Models
Parallelism in Distributed Training
Search Space Construction
Search Space Analysis
Overhead Analysis
Two-GPU Example
Multi-GPU Extension.
Decision-tree-based Search Space Decomposition
Parallelism Optimization Framework
Basic Parallelism Optimization
Basic Optimization Workflow
Dynamic Programming Search
Complexity Analysis
...and 20 more sections

Figures (9)

Figure 1: System overview of Galvatron-BMW.
Figure 2: Illustration of different basic parallelisms in Galvatron-BMW. We use the green and gray colors to denote the input and output activations for both forward and backward computation. The model parameters and gradients are in blue.
Figure 3: Illustration of the decision trees for 8 GPUs under different PP degrees (i.e., 8/4/2/1). We select one of them to introduce how to use the tree to describe the candidate hybrid parallelism strategies. We remove $S_1$ and $S_2$ as suggested by Takeaway #3 and illustrate the other hybrid strategies on the right part. Each decision tree can be decided to apply CKPT ($S_3^{'}$-$S_6^{'}$) or not to apply CKPT ($S_3$-$S_6$). In total, there are 44 candidate hybrid strategies for all trees.
Figure 4: Performance of 4-way 1F1B-Flush pipelines with different partition plans on A100 GPUs. The global batch size is 32 for BERT-Huge-48 and 64 for T5-512/4-48 (see more details of the model in Section \ref{['subsection:exp_setup']}), and the micro-batch number is 8. Bars (from left to right) symbolize pipeline stage 1 through 4: height for memory consumption, width for time cost (normalized), including the number of layers, balance degrees and throughput.
Figure 5: Different memory budgets
...and 4 more figures

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

TL;DR

Abstract

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)