GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma; Vivek M. Bhasi; Sonali Singh; George Kesidis; Mahmut T. Kandemir; Chita R. Das

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma, Vivek M. Bhasi, Sonali Singh, George Kesidis, Mahmut T. Kandemir, Chita R. Das

TL;DR

This work addresses the high cost of distributed DL training by mitigating network-induced delays through a network-aware scheduler, Dally, which combines delay scheduling, a network-sensitive preemption policy, and an auto-tuner for dynamic delay timers. It co-designs hardware and software by leveraging modern high-speed networks (e.g., NVSwitch, GPU RDMA) and introduces ArtISt-sim, a high-fidelity multi-job DL cluster simulator built on ASTRA-sim to accurately model network slowdowns from concrete placements. Empirical results from trace-driven simulations show Dally achieving up to 69% makespan reduction, up to 83% lower communication latency, and significant tail-queueing improvements over state-of-the-art baselines. The approach enables cost-effective, scalable DL deployment in multi-tenant cloud environments and provides a practical platform for researching DL scheduling with contemporary networking hardware.

Abstract

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

GPU Cluster Scheduling for Network-Sensitive Deep Learning

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 13 figures, 3 tables, 2 algorithms)

This paper contains 40 sections, 2 equations, 13 figures, 3 tables, 2 algorithms.

INTRODUCTION
BACKGROUND
Delay Scheduling Based on Data Locality
Simulation Platforms
MOTIVATION
Communication Latency in DDL Clusters
Job Consolidation in DL Clusters
Improvements in SOTA Network Hardware
Limitations of Current SOTA DL Cluster Schedulers
A high-fidelity simulator for DDL scheduling research
DALLY DESIGN
Overview
System goal
Objective
Constraints
...and 25 more sections

Figures (13)

Figure 1: Single iteration training time for models consolidated on the same machine, rack, and across the network. Latency increases as the GPU workers grow physically apart.
Figure 2: A typical (hierarchical) datacenter network (n/w).
Figure 3: Scheduling scheme.
Figure 4: Auto-tuning timeline for rack-level delay timers.
Figure 5: Simulation design.
...and 8 more figures

GPU Cluster Scheduling for Network-Sensitive Deep Learning

TL;DR

Abstract

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)