Table of Contents
Fetching ...

Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling

Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui

TL;DR

Spindle tackles the resource-inefficiency of training large MT MM models by introducing wavefront scheduling built on MetaOps and graph contraction to handle workload heterogeneity and execution dependencies. It combines a scalability estimator with a malleable scheduling allocator and a greedy wavefront scheduler, plus careful device placement and a runtime engine to realize the plan. Empirical results show up to 71% speedup over Megatron-LM and DeepSpeed, with high device utilization and favorable memory balance, across diverse MT MM workloads and scales. The approach offers a practical, near-optimal framework for accelerating MT MM training in real clusters, enabling scalable, multi-task multi-modal AI capabilities.

Abstract

Recent foundation models are capable of handling multiple tasks and multiple data modalities with the unified base model structure and several specialized model components. However, efficient training of such multi-task (MT) multi-modal (MM) models poses significant system challenges due to the sophisticated model architecture and the heterogeneous workloads of different tasks and modalities. In this paper, we propose Spindle, a brand new training system tailored for resource-efficient and high-performance training of MT MM models via wavefront scheduling. The key idea of Spindle is to decompose the model execution into waves and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. We build our system and evaluate it on various MT MM models. Experiments demonstrate the superior performance and efficiency of Spindle, with speedup ratio up to 71% compared to state-of-the-art training systems.

Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling

TL;DR

Spindle tackles the resource-inefficiency of training large MT MM models by introducing wavefront scheduling built on MetaOps and graph contraction to handle workload heterogeneity and execution dependencies. It combines a scalability estimator with a malleable scheduling allocator and a greedy wavefront scheduler, plus careful device placement and a runtime engine to realize the plan. Empirical results show up to 71% speedup over Megatron-LM and DeepSpeed, with high device utilization and favorable memory balance, across diverse MT MM workloads and scales. The approach offers a practical, near-optimal framework for accelerating MT MM training in real clusters, enabling scalable, multi-task multi-modal AI capabilities.

Abstract

Recent foundation models are capable of handling multiple tasks and multiple data modalities with the unified base model structure and several specialized model components. However, efficient training of such multi-task (MT) multi-modal (MM) models poses significant system challenges due to the sophisticated model architecture and the heterogeneous workloads of different tasks and modalities. In this paper, we propose Spindle, a brand new training system tailored for resource-efficient and high-performance training of MT MM models via wavefront scheduling. The key idea of Spindle is to decompose the model execution into waves and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. We build our system and evaluate it on various MT MM models. Experiments demonstrate the superior performance and efficiency of Spindle, with speedup ratio up to 71% compared to state-of-the-art training systems.
Paper Structure (28 sections, 1 theorem, 8 equations, 17 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 1 theorem, 8 equations, 17 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

If the execution time functions $T_m(n)$, $n\in \mathbb{R}^+$, are positive and non-increasing for every MetaOp $m\in\mathcal{\widetilde{V}}_M$, then $P_{MPSP} = \{m \rightarrow P_m\}$ satisfies that $P_m = \{\langle n_m^\ast, 0, L_m\rangle\}, \forall m \in \mathcal{\widetilde{V}}_M$, where the opti

Figures (17)

  • Figure 1: The upper portion illustrates the general model structure and training flow of MT MM training. The lower portion displays the current device utilization, measured in FLOPs per second, during the decoupled execution of four tasks across 2 iterations. Utilization fluctuation of different-colored and same-colored lines indicate inter-task and intra-task workload heterogeneity, respectively.
  • Figure 2: Architecture overview of Spindle.
  • Figure 3: Computation graph $\mathcal{G}$ and contracted MetaGraph $\mathcal{G}_M$.
  • Figure 4: An example of the execution time and resource scalability of MetaOps in 4-task Multitask-CLIP, denoted as scaling curves.
  • Figure 5: Illustration of Spindle allocator and Spindle execution plan.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Theorem 1