Table of Contents
Fetching ...

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

Hao Lin, Ke Wu, Jie Li, Jun Li, Wu-Jun Li

TL;DR

UniAP tackles sub-optimal automatic parallelism by jointly optimizing inter-layer and intra-layer strategies through a unified mixed integer quadratic programming framework. It builds time and memory cost models from profiling data, encodes the problem as an MIQP to maximize training throughput (minimize TPI), and applies a Unified Optimization Process that enumerates pipeline configurations and solves for optimal layer placement and intra-layer strategies. Empirical results across five Transformer models and multiple environments show substantial gains in throughput (up to $3.80\times$) and dramatic reductions in strategy-optimization time (up to $107\times$) compared to state-of-the-art baselines, demonstrating practical impact for scalable distributed training. While demonstrated on homogeneous clusters, UniAP sets the stage for extending automatic parallelism to heterogeneous hardware in future work.

Abstract

Distributed learning is commonly used for training deep learning models, especially large models. In distributed learning, manual parallelism (MP) methods demand considerable human effort and have limited flexibility. Hence, automatic parallelism (AP) methods have recently been proposed for automating the parallel strategy optimization process. Existing AP methods suffer from sub-optimal solutions because they do not jointly optimize the two categories of parallel strategies (i.e., inter-layer parallelism and intra-layer parallelism). In this paper, we propose a novel AP method called UniAP, which unifies inter- and intra-layer automatic parallelism by mixed integer quadratic programming. To the best of our knowledge, UniAP is the first parallel method that can jointly optimize the two categories of parallel strategies to find an optimal solution. Experimental results show that UniAP outperforms state-of-the-art methods by up to 3.80$\times$ in throughput and reduces strategy optimization time by up to 107$\times$ across five Transformer-based models.

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

TL;DR

UniAP tackles sub-optimal automatic parallelism by jointly optimizing inter-layer and intra-layer strategies through a unified mixed integer quadratic programming framework. It builds time and memory cost models from profiling data, encodes the problem as an MIQP to maximize training throughput (minimize TPI), and applies a Unified Optimization Process that enumerates pipeline configurations and solves for optimal layer placement and intra-layer strategies. Empirical results across five Transformer models and multiple environments show substantial gains in throughput (up to ) and dramatic reductions in strategy-optimization time (up to ) compared to state-of-the-art baselines, demonstrating practical impact for scalable distributed training. While demonstrated on homogeneous clusters, UniAP sets the stage for extending automatic parallelism to heterogeneous hardware in future work.

Abstract

Distributed learning is commonly used for training deep learning models, especially large models. In distributed learning, manual parallelism (MP) methods demand considerable human effort and have limited flexibility. Hence, automatic parallelism (AP) methods have recently been proposed for automating the parallel strategy optimization process. Existing AP methods suffer from sub-optimal solutions because they do not jointly optimize the two categories of parallel strategies (i.e., inter-layer parallelism and intra-layer parallelism). In this paper, we propose a novel AP method called UniAP, which unifies inter- and intra-layer automatic parallelism by mixed integer quadratic programming. To the best of our knowledge, UniAP is the first parallel method that can jointly optimize the two categories of parallel strategies to find an optimal solution. Experimental results show that UniAP outperforms state-of-the-art methods by up to 3.80 in throughput and reduces strategy optimization time by up to 107 across five Transformer-based models.
Paper Structure (27 sections, 1 theorem, 13 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 1 theorem, 13 equations, 10 figures, 6 tables, 1 algorithm.

Key Result

Theorem B.1

A subgraph with node set $\mathbb{V}_i=\{\forall u\in \mathbb{V}: \textbf{P}_{ui}=1\}$ is contiguous if and only if there exists $\textbf{Z}_{vi}$ such that Equation eqn:method:order-preserving:1, eqn:method:order-preserving:2, and eqn:method:order-preserving:3 are satisfied.

Figures (10)

  • Figure 1: Parallel methods for optimizing parallel strategies for a three-layer model. The different arrangements of slices with varying transparency within the same layer block indicate different intra-layer parallelism strategies adopted by layers. The different arrangements of gray blocks which wrap the layer blocks indicate different inter-layer parallelism strategies.
  • Figure 2: Flowchart of UniAP.
  • Figure 3: Time cost decomposition of a GPipe-style PP.
  • Figure 4: A contiguous set.
  • Figure 5: Scalability on training throughput and strategy optimization time.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 3.1
  • Theorem B.1
  • proof