Table of Contents
Fetching ...

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

Ziyue Luo, Jia Liu, Myungjin Lee, Ness B. Shroff

TL;DR

An adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling.

Abstract

The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models and their associated distributed training configurations, A-SRPT strategically assigns jobs to the available GPUs, thereby minimizing inter-server communication overhead. Observing that most DDLwMP jobs recur, A-SRPT incorporates a random forest regression model to predict training iterations. Crucially, A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy. This optimized solution serves as a guide for actual job scheduling within the GPU clusters, leading to a theoretically provable competitive scheduling efficiency. We conduct extensive real-world testbed and simulation experiments to verify our proposed algorithms.

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters

TL;DR

An adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling.

Abstract

The recent explosive growth of deep learning (DL) models has necessitated a compelling need for efficient job scheduling for distributed deep learning training with mixed parallelisms (DDLwMP) in GPU clusters. This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm, a novel prediction-assisted online scheduling approach designed to mitigate the challenges associated with DL cluster scheduling. By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models and their associated distributed training configurations, A-SRPT strategically assigns jobs to the available GPUs, thereby minimizing inter-server communication overhead. Observing that most DDLwMP jobs recur, A-SRPT incorporates a random forest regression model to predict training iterations. Crucially, A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy. This optimized solution serves as a guide for actual job scheduling within the GPU clusters, leading to a theoretically provable competitive scheduling efficiency. We conduct extensive real-world testbed and simulation experiments to verify our proposed algorithms.
Paper Structure (18 sections, 4 theorems, 21 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 theorems, 21 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

$\textit{$\textit{OPT}_{{A_{1}}}$} \leq \rho\textit{OPT}_A$, where $\rho = \max_{i\in[I]}\frac{\alpha_i^{\max}}{\alpha_i^{\min}}$.

Figures (9)

  • Figure 1: Three typical parallelisms for distributed DNN training.
  • Figure 2: GPU mapping: An illustrative example.
  • Figure 3: Algorithmic idea overview.
  • Figure 4: Percentage of jobs: different prediction errors.
  • Figure 5: Testbed experiment performance.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1: Total job completion time achieved by A-SRPT
  • proof
  • proof
  • proof
  • proof