Table of Contents
Fetching ...

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma, Vivek M. Bhasi, Sonali Singh, George Kesidis, Mahmut T. Kandemir, Chita R. Das

TL;DR

This work addresses the high cost of distributed DL training by mitigating network-induced delays through a network-aware scheduler, Dally, which combines delay scheduling, a network-sensitive preemption policy, and an auto-tuner for dynamic delay timers. It co-designs hardware and software by leveraging modern high-speed networks (e.g., NVSwitch, GPU RDMA) and introduces ArtISt-sim, a high-fidelity multi-job DL cluster simulator built on ASTRA-sim to accurately model network slowdowns from concrete placements. Empirical results from trace-driven simulations show Dally achieving up to 69% makespan reduction, up to 83% lower communication latency, and significant tail-queueing improvements over state-of-the-art baselines. The approach enables cost-effective, scalable DL deployment in multi-tenant cloud environments and provides a practical platform for researching DL scheduling with contemporary networking hardware.

Abstract

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

GPU Cluster Scheduling for Network-Sensitive Deep Learning

TL;DR

This work addresses the high cost of distributed DL training by mitigating network-induced delays through a network-aware scheduler, Dally, which combines delay scheduling, a network-sensitive preemption policy, and an auto-tuner for dynamic delay timers. It co-designs hardware and software by leveraging modern high-speed networks (e.g., NVSwitch, GPU RDMA) and introduces ArtISt-sim, a high-fidelity multi-job DL cluster simulator built on ASTRA-sim to accurately model network slowdowns from concrete placements. Empirical results from trace-driven simulations show Dally achieving up to 69% makespan reduction, up to 83% lower communication latency, and significant tail-queueing improvements over state-of-the-art baselines. The approach enables cost-effective, scalable DL deployment in multi-tenant cloud environments and provides a practical platform for researching DL scheduling with contemporary networking hardware.

Abstract

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.
Paper Structure (40 sections, 2 equations, 13 figures, 3 tables, 2 algorithms)

This paper contains 40 sections, 2 equations, 13 figures, 3 tables, 2 algorithms.

Figures (13)

  • Figure 1: Single iteration training time for models consolidated on the same machine, rack, and across the network. Latency increases as the GPU workers grow physically apart.
  • Figure 2: A typical (hierarchical) datacenter network (n/w).
  • Figure 3: Scheduling scheme.
  • Figure 4: Auto-tuning timeline for rack-level delay timers.
  • Figure 5: Simulation design.
  • ...and 8 more figures