Table of Contents
Fetching ...

WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, Yuchen Hao, Yufei Ding

TL;DR

The paper tackles workload imbalance in 4D parallelism for large language model training, caused by input-dependent attention and long documents. It introduces WLB-LLM, which combines workload-aware var-length packing at the pipeline level with fine-grained per-document sharding at the context level, plus an adaptive runtime sharding selector and an outlier-delay mechanism. Empirical results across model scales and context windows show an average end-to-end speedup of $1.23\times$ (and up to $1.30\times$ with longer contexts), while preserving data randomness and convergence. The work offers practical methods to improve efficiency for long-context LLM training on large GPU clusters, enabling more cost-effective scaling of future models.

Abstract

In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.

WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

TL;DR

The paper tackles workload imbalance in 4D parallelism for large language model training, caused by input-dependent attention and long documents. It introduces WLB-LLM, which combines workload-aware var-length packing at the pipeline level with fine-grained per-document sharding at the context level, plus an adaptive runtime sharding selector and an outlier-delay mechanism. Empirical results across model scales and context windows show an average end-to-end speedup of (and up to with longer contexts), while preserving data randomness and convergence. The work offers practical methods to improve efficiency for long-context LLM training on large GPU clusters, enabling more cost-effective scaling of future models.

Abstract

In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.

Paper Structure

This paper contains 23 sections, 2 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Observed workload imbalance issue in large-scale LLM training jobs and the reason of workload imbalance.
  • Figure 2: Overview of 4D parallelism for LLM training.
  • Figure 3: Input Data Statistics: Distribution of input document lengths (Left) and average document length at each token position (Right).
  • Figure 4: The workload imbalance comes from the PP-level document packing and CP-level sequence sharding.
  • Figure 5: The process of latency propagation in 4D parallelism LLM training across different parallelism hierarchies. The impact of workload imbalance is enlarged at the PP level.
  • ...and 11 more figures