Improving Automatic Parallel Training via Balanced Memory Workload Optimization
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui
TL;DR
This work tackles the challenge of efficiently training Transformer models across multiple GPUs by automatically exploring a rich, five-dimensional parallelism space that includes DP, SDP, TP, PP, and CKPT. It introduces Galvatron-BMW, which uses a decision-tree decomposition to prune the search space and a dynamic programming search to identify optimal hybrid plans, augmented with a bi-objective workload-balancing framework to maximize hardware utilization. The system estimates compute, communication, and memory costs, accounting for overlapping compute and communication and CKPT recomputation, and is implemented atop PyTorch with NCCL support. Empirical results across NLP and CV models demonstrate substantial throughput improvements over state-of-the-art baselines, with up to 530% gains over pure Parallelisms and strong gains over prior automatic-hybrid approaches, under diverse hardware budgets. The practical impact is a scalable, user-friendly automatic parallel training workflow that can handle large Transformer models with heterogeneous memory footprints and interconnects.
Abstract
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.
