UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
Hao Lin, Ke Wu, Jie Li, Jun Li, Wu-Jun Li
TL;DR
UniAP tackles sub-optimal automatic parallelism by jointly optimizing inter-layer and intra-layer strategies through a unified mixed integer quadratic programming framework. It builds time and memory cost models from profiling data, encodes the problem as an MIQP to maximize training throughput (minimize TPI), and applies a Unified Optimization Process that enumerates pipeline configurations and solves for optimal layer placement and intra-layer strategies. Empirical results across five Transformer models and multiple environments show substantial gains in throughput (up to $3.80\times$) and dramatic reductions in strategy-optimization time (up to $107\times$) compared to state-of-the-art baselines, demonstrating practical impact for scalable distributed training. While demonstrated on homogeneous clusters, UniAP sets the stage for extending automatic parallelism to heterogeneous hardware in future work.
Abstract
Distributed learning is commonly used for training deep learning models, especially large models. In distributed learning, manual parallelism (MP) methods demand considerable human effort and have limited flexibility. Hence, automatic parallelism (AP) methods have recently been proposed for automating the parallel strategy optimization process. Existing AP methods suffer from sub-optimal solutions because they do not jointly optimize the two categories of parallel strategies (i.e., inter-layer parallelism and intra-layer parallelism). In this paper, we propose a novel AP method called UniAP, which unifies inter- and intra-layer automatic parallelism by mixed integer quadratic programming. To the best of our knowledge, UniAP is the first parallel method that can jointly optimize the two categories of parallel strategies to find an optimal solution. Experimental results show that UniAP outperforms state-of-the-art methods by up to 3.80$\times$ in throughput and reduces strategy optimization time by up to 107$\times$ across five Transformer-based models.
