Table of Contents
Fetching ...

Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment

Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jinbao Xue, Yangyu Tao, Di Wang, Jie Jiang, Bin Cui

TL;DR

This work tackles data-induced inefficiencies in training large Transformer models by addressing both data sampling and data packing imbalances. It introduces Hydraulis, a co-design system that jointly optimizes dynamic heterogeneous parallel strategies and a two-stage sequence assignment, supported by optimization-propagation disaggregation and subgraphs to enable flexible strategy transitions. Key contributions include the LNK-based communication abstraction, two-stage sequence packing and dispatching, and a data distribution-aware strategy generator, achieving $1.32$-$2.66\times$ throughput gains over state-of-the-art baselines. The approach demonstrates strong scalability and load balancing on large GPU clusters, offering practical impact for efficient training of very large models.

Abstract

To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.

Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment

TL;DR

This work tackles data-induced inefficiencies in training large Transformer models by addressing both data sampling and data packing imbalances. It introduces Hydraulis, a co-design system that jointly optimizes dynamic heterogeneous parallel strategies and a two-stage sequence assignment, supported by optimization-propagation disaggregation and subgraphs to enable flexible strategy transitions. Key contributions include the LNK-based communication abstraction, two-stage sequence packing and dispatching, and a data distribution-aware strategy generator, achieving - throughput gains over state-of-the-art baselines. The approach demonstrates strong scalability and load balancing on large GPU clusters, offering practical impact for efficient training of very large models.

Abstract

To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.

Paper Structure

This paper contains 43 sections, 22 equations, 20 figures, 4 tables, 1 algorithm.

Figures (20)

  • Figure 1: Sequence length distribution of two popular open-sourced datasets. We display the number of sequences and tokens within each length range, with both datasets exhibiting high variance.
  • Figure 2: Left: Illustration of padding and packing. Right: Comparison of the attention latency during LLaMA2 7B training on Nvidia A800 GPUs with tensor parallel (TP). Standard attention scales quadratically, while varlen achieves linear scaling relative to the sequence length (1K) before packing.
  • Figure 3: "O, P, G" represents optimizer states, parameters and gradients, respectively. "AG, RS, S, R" represents all-gather, reduce-scatter, send and receive, respectively. "To" represents the data type transfer operator (from 32- to 16-bit floating points), and "LNK" represents the link operator designed by us. (a) The overall workflow of the model states sharding technique combined with mixed-precision training. (b) Our proposed optimization-propagation disaggregation technique, which decouples the distributed strategies for the optimization phase and propagation phase, allowing for arbitrary heterogeneous parallel strategies during propagation.
  • Figure 4: The dot plot on the left shows the fluctuation of maximum sequence length, emphasizing inter-iteration imbalance, while the bar chart on the right presents the distribution of sequence lengths within an iteration (20th iteration), highlighting intra-iteration imbalance.
  • Figure 5: Memory and throughput trade-offs. "OOM" indicates out-of-memory, and "N/A" indicates not available. The experiment is conducted using a 13B LLaMA model with 16 Nvidia A800 GPUs. Throughput is measured in tokens per second and maximum sequence lengths are given by $\texttt{MaxLen}(\cdot)$ (see §\ref{['subsec:sequence_packing']} and Appendix \ref{['appendix:memory_model']}hydraulis_appendix).
  • ...and 15 more figures