Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jinbao Xue, Yangyu Tao, Di Wang, Jie Jiang, Bin Cui
TL;DR
This work tackles data-induced inefficiencies in training large Transformer models by addressing both data sampling and data packing imbalances. It introduces Hydraulis, a co-design system that jointly optimizes dynamic heterogeneous parallel strategies and a two-stage sequence assignment, supported by optimization-propagation disaggregation and subgraphs to enable flexible strategy transitions. Key contributions include the LNK-based communication abstraction, two-stage sequence packing and dispatching, and a data distribution-aware strategy generator, achieving $1.32$-$2.66\times$ throughput gains over state-of-the-art baselines. The approach demonstrates strong scalability and load balancing on large GPU clusters, offering practical impact for efficient training of very large models.
Abstract
To optimize large Transformer model training, both efficient parallel computing and advanced data management are indispensable. However, current methods often assume a stable and uniform training workload, neglecting data-induced imbalances-arising from both sampling and packing processes-which can impede training performance. Specifically, data sampling imbalance arises from uneven sequence length distribution of the training data, while data packing imbalance stems from the discrepancy between the linear memory complexity and quadratic time complexity of the attention mechanism. To address these imbalance issues, we develop Hydraulis, which jointly optimizes the parallel strategies and data assignment. For one thing, we introduce large model training with dynamic heterogeneous parallel strategies in response to the sequence length variations within and across training iterations. For another, we devise a two-stage data assignment approach, which strikes a good balance in terms of the training workloads both within and across model replicas. Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.
