Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo; Qiang Wang; Shaohuai Shi; Jiaxin Lai; Shuhan Qi; Jiajia Zhang; Xuan Wang

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Yizhou Luo, Qiang Wang, Shaohuai Shi, Jiaxin Lai, Shuhan Qi, Jiajia Zhang, Xuan Wang

TL;DR

This work tackles DL job scheduling in multi-tenant GPU clusters by enabling GPU sharing across concurrent jobs while preserving model convergence via gradient accumulation. It introduces SJF-BSBF, a non-preemptive heuristic that uses pairwise optimality results (Theorem 1) to guide sharing decisions and batch-size scaling, then greedily extends to many jobs. Theoretically, it derives conditions for optimal pairwise scheduling and, practically, provides an online algorithm with provable efficiency. Empirical evaluation on physical clusters and large-scale simulations shows substantial reductions in average job completion time (27-33% vs Tiresias, up to 17% over aggressive sharing baselines) and high fidelity between simulated and real results, highlighting the importance of carefully chosen sharing settings.

Abstract

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster utilization through periodic reallocation or selection of GPU resources, they also encounter challenges such as preemption and migration overhead, along with potential DL accuracy degradation. Nonetheless, few explore the potential benefits of GPU sharing to improve resource utilization and reduce job queuing times. Motivated by these insights, we present a job scheduling model allowing multiple jobs to share the same set of GPUs without altering job training settings. We introduce SJF-BSBF (shortest job first with best sharing benefit first), a straightforward yet effective heuristic scheduling algorithm. SJF-BSBF intelligently selects job pairs for GPU resource sharing and runtime settings (sub-batch size and scheduling time point) to optimize overall performance while ensuring DL convergence accuracy through gradient accumulation. In experiments with both physical DL workloads and trace-driven simulations, even as a preemption-free policy, SJF-BSBF reduces the average job completion time by 27-33\% relative to the state-of-the-art preemptive DL schedulers. Moreover, SJF-BSBF can wisely determine the optimal resource sharing settings, such as the sharing time point and sub-batch size for gradient accumulation, outperforming the aggressive GPU sharing approach (baseline SJF-FFS policy) by up to 17\% in large-scale traces.

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

TL;DR

Abstract

Paper Structure (30 sections, 16 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 30 sections, 16 equations, 6 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
S-SGD Based Distributed Deep Learning
All-Reduce Communication
System Modeling and Problem Formulation
DL Job Training Time Modeling
Modeling GPU Computation
Modeling Network Communication
Sharing Performance Modeling
Modeling Gradient Accumulation
Scheduling Modeling
Problem Formulation
Solution
Scheduling One Job Pair
...and 15 more sections

Figures (6)

Figure 1: Three job schedules for two DL jobs.
Figure 2: System throughput for all DL models in our experiments, as measured using a 4-server cluster each with 4 NVIDIA 2080Ti GPU. Each sub-figure shows the values of different resource and training batch size settings for each model.
Figure 3: TOP: System throughput of difference DL models paired with CIFAR10 to share the same set of GPUs. BOTTOM: the interference ratio $\xi$ for different DL models and resource and training settings.
Figure 4: Performance of different policies in physical experiments.
Figure 5: Performance of different policies in simulation experiments.
...and 1 more figures

Theorems & Definitions (1)

proof

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

TL;DR

Abstract

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)