Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters
Abeda Sultana, Nabin Pakka, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng
TL;DR
The paper tackles scheduling DL training on heterogeneous GPU clusters by introducing Hadar, a task-level heterogeneity-aware scheduler that optimizes resource allocation across both space and time using a primal-dual optimization framework. Hadar models fine-grained performance across accelerator types, derives a dual-based scheduling method with a dynamic price function, and provides polynomial-time algorithms with competitive guarantees. It further enhances resource utilization by HadarE, which forks jobs into multiple copies to run concurrently on different nodes and includes aggregation/consolidation of results, backed by theoretical maximal utilization proofs. Trace-driven and physical-cluster evaluations show Hadar and HadarE outperform state-of-the-art schedulers (notably Gavel) in CRU, total training time, and mean job completion time, with HadarE also delivering improvements in inference quality for trained models. The work offers a practical, provably-effective approach for scalable, efficient DL training on heterogeneous hardware in cloud and on-premise environments.
Abstract
Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an optimization framework that can boost resource utilization. Hadar leverages the performance traits of DL jobs on a heterogeneous DL cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. It involves the primal-dual framework employing a dual subroutine, to solve the optimization problem and guide the scheduling design. Our trace-driven simulation with representative DL model training workloads demonstrates that Hadar accelerates the total time duration by 1.20x when compared with its state-of-the-art heterogeneity-aware counterpart, Gavel. Further, our Hadar scheduler is enhanced to HadarE by forking each job into multiple copies to let a job train concurrently on heterogeneous GPUs resided on separate available nodes (i.e., machines or servers) for resource utilization enhancement. HadarE is evaluated extensively on physical DL clusters for comparison with Hadar and Gavel. With substantial enhancement in cluster resource utilization (by 1.45x), HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster, while producing trained DL models with consistently better inference quality than those trained by Hadar.
