WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, Yufei Ding
TL;DR
The paper tackles workload imbalance in 4D parallelism for large language model training, caused by input-dependent attention and long documents. It introduces WLB-LLM, which combines workload-aware var-length packing at the pipeline level with fine-grained per-document sharding at the context level, plus an adaptive runtime sharding selector and an outlier-delay mechanism. Empirical results across model scales and context windows show an average end-to-end speedup of $1.23\times$ (and up to $1.30\times$ with longer contexts), while preserving data randomness and convergence. The work offers practical methods to improve efficiency for long-context LLM training on large GPU clusters, enabling more cost-effective scaling of future models.
Abstract
In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.
