A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

Chunyu Xue; Weihao Cui; Han Zhao; Quan Chen; Shulai Zhang; Pengyu Yang; Jing Yang; Shaobo Li; Minyi Guo

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, Minyi Guo

TL;DR

This paper tackles the challenge of efficiently training very large models on heterogeneous GPU clusters by jointly optimizing resource scheduling and adaptive parallelism. It introduces Crius, a holistic training system that centers on a new core abstraction called Cell, which fixes resource allocation and pipeline stages while exposing data and tensor parallelism for runtime exploration. Crius employs an agile, decoupled estimator to rapidly forecast Cell performance and a Cell-guided tuner to prune the parallelism search, enabling near-optimal plans with low overhead. Across physical and simulated large-scale clusters, Crius achieves up to substantial improvements in cluster throughput and reductions in job completion time and queuing delay, demonstrating strong scalability and generality, including a deadline-aware extension. The approach promises practical impact for deploying and optimizing large-model training in diverse, real-world GPU environments.

Abstract

Joint consideration of scheduling and adaptive parallelism offers great opportunities for improving the training efficiency of large models on heterogeneous GPU clusters. However, integrating adaptive parallelism into a cluster scheduler expands the cluster scheduling space. The new space is the product of the original scheduling space and the parallelism exploration space of adaptive parallelism (also a product of pipeline, data, and tensor parallelism). The exponentially enlarged scheduling space and ever-changing optimal parallelism plan from adaptive parallelism together result in the contradiction between low-overhead and accurate performance data acquisition for efficient cluster scheduling. This paper presents Crius, a training system for efficiently scheduling multiple large models with adaptive parallelism in a heterogeneous cluster. Crius proposes a novel scheduling granularity called Cell. It represents a job with deterministic resources and pipeline stages. The exploration space of Cell is shrunk to the product of only data and tensor parallelism, thus exposing the potential for accurate and low-overhead performance estimation. Crius then accurately estimates Cells and efficiently schedules training jobs. When a Cell is selected as a scheduling choice, its represented job runs with the optimal parallelism plan explored. Experimental results show that Crius reduces job completion time by up to 48.9% and schedules large models with up to 1.49x cluster throughput improvement.

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

TL;DR

Abstract

Paper Structure (50 sections, 21 figures, 2 tables, 1 algorithm)

This paper contains 50 sections, 21 figures, 2 tables, 1 algorithm.

Introduction
Background and Motivation
Training with Adaptive Parallelism
Scheduling Opportunities
Contradiction in Efficient Scheduling
Crius Design
Cell as the Core Abstraction
Cluster-friendly Scheduling Workflow
Sharding Scheduling Space into Cells
Cell.
Stage determination of a Cell.
Complexity analysis after sharding.
Mechanisms for Leveraging Cell
Agile Cell Estimation
Decoupling communication and computation.
...and 35 more sections

Figures (21)

Figure 1: Scheduling decisions contribute to different cluster-level throughput with the same resources: Case-A schedules two jobs onto 4×A100 connected with NVLink, Case-B schedules two jobs onto 4×A100 connected with PCIe and 4×V100 connected with NVLink. Each job runs with the optimal parallelism plan expolored by adaptive parallelism.
Figure 2: The workflow of training large models with adaptive parallelism in a heterogeneous GPU cluster. D for data parallelism, T for tensor parallelism, P for pipeline parallelism. Cell is the new scheduling candidate proposed by Crius.
Figure 3: Throughput of different scheduling choices with adaptive parallelism. (a) Homogeneous resources are scaled between models. (b) Heterogeneous resources are exchanged between models. ( Explanation for allocation plan: (4,2,2,0) means 4 GPUs for WRes-2B, 2 GPUs for MoE-2.4B, 2 GPUs for BERT-1.3B and 0 GPUs for MoE-1.3B. OOM indicates that WRes-2B cannot be accommodated with 2xA100 GPUs. D for data parallelism, T for tensor parallelism, P for pipeline parallelism.)
Figure 4: Parallelism plan and job performance variation when changing (a) GPU number: MoE-1.3B scales up linearly, while others approach the performance ceiling; (b) GPU type and (c) GPU topology: models of BERT and MoE have greater variance in throughput due to their change of parallelism plan; (The optimal parallelism plans are marked on top of each bar).
Figure 5: Architecture overview of Crius.
...and 16 more figures

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

TL;DR

Abstract

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

Authors

TL;DR

Abstract

Table of Contents

Figures (21)