Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

Xinyi Zhang; Hanyu Zhao; Wencong Xiao; Xianyan Jia; Fei Xu; Yong Li; Wei Lin; Fangming Liu

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li, Wei Lin, Fangming Liu

TL;DR

Rubick tackles the challenge of scheduling DL training in shared GPU clusters by introducing execution-plan reconfigurability as a new scheduling dimension. It builds a white-box performance model that predicts throughput for combinations of models, execution plans (e.g., Megatron-style 3D parallelism, ZeRO variants, GA/GC), and multi-resource allocations, and uses resource sensitivity curves to guide a heuristic scheduler that co-optimizes plans and resources while guaranteeing SLA for guaranteed jobs. The system is implemented on Kubernetes with DeepSpeed and Megatron, and evaluated on a 64-GPU cluster, showing up to 3.2x improvements in average JCT and 1.4x in makespan over state-of-the-art baselines, along with robust SLA enforcement. The work demonstrates that continuous reconfiguration of execution plans, informed by a predictive model, can significantly improve cluster throughput and efficiency in dynamic, multi-tenant DL environments, particularly as model sizes scale.

Abstract

The era of large deep learning models has given rise to advanced training strategies such as 3D parallelism and the ZeRO series. These strategies enable various (re-)configurable execution plans for a training job, which exhibit remarkably different requirements of multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they rely on users to choose execution plans statically, and then make resource allocations without awareness of the chosen plans and their resource requirements. This approach results in mismatches between execution plans and resources, making both training performance and cluster utilization far from optimal. We introduce Rubick, a cluster scheduling system for deep learning training that exploits the reconfigurability to improve job performance and cluster efficiency. Rubick incorporates the job execution planning as a new dimension in cluster scheduling, by continuously reconfiguring jobs' execution plans and tuning multi-resource allocations across jobs jointly. Such a co-optimization is navigated by a performance model that understands the diverse resource requirements and performance characteristics of different jobs and execution plans. Rubick exploits such a model to make performance-aware scheduling decisions to maximize cluster throughput while providing performance guarantees to individual jobs. Evaluations on a 64-GPU high-performance training cluster show that Rubick improves average job completion time and makespan by up to 3.2x and 1.4x, respectively, compared against state-of-the-art systems.

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

TL;DR

Abstract

Paper Structure (46 sections, 1 equation, 11 figures, 4 tables, 1 algorithm)

This paper contains 46 sections, 1 equation, 11 figures, 4 tables, 1 algorithm.

Introduction
Background and Motivation
Large Model Training in GPU Clusters
Opportunity and Challenge
Opportunity: diverse multi-resource demands of different execution plans.
Challenge: complex performance characteristics of model-plan-resource combinations.
Summary.
System Overview
Modeling Reconfigurable DL Training
Modeling Computation and Communication
Modeling $T_{fwd}$.
Modeling $T_{bwd}$.
Modeling $T_{comm}$.
Combining computation and communication.
Modeling Optimizer and Offloading
...and 31 more sections

Figures (11)

Figure 1: Overview of Rubick. Its fundamental capability lies in leveraging white-box execution plans to enable job reconfiguration and cluster-level throughput optimization. Job execution plans (e.g., TP, PP, GC) are elaborated in Sec. \ref{['sec:motivation-background']}.
Figure 2: Consumption of each resource type for GPT-2 using various training execution plans, normalized to the highest value ($8$ GPUs, $10$ CPUs, $3.2$ GB memory, and $30$ GB/s bandwidth).
Figure 3: Throughput variation using various execution plans with changing resource limits. The first hour is using $4$ servers with $8$ A800 GPUs for each, and the second hour is using $4$ servers with $4$ A800 GPUs. The rest are using a $4$-A800 server. TP+DP/PP means using TP inside nodes and DP/PP across nodes. Megatron 3D adopts a feasible TP+PP configuration such that each partition fits in a GPU, then scaling out using DP.
Figure 4: Rubick architecture and scheduling workflow.
Figure 5: Simplified illustration of the performance model. Note that the overlapping of the parts only means the overlapping of their time spans; the real execution is not necessarily overlapped, which depends on the specific strategy.
...and 6 more figures

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

TL;DR

Abstract

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

Authors

TL;DR

Abstract

Table of Contents

Figures (11)