DARIS: An Oversubscribed Spatio-Temporal Scheduler for Real-Time DNN Inference on GPUs
Amir Fakhim Babaei, Thidapat Chantem
TL;DR
DARIS addresses real-time, multi-tenant DNN inference on GPUs by introducing an oversubscribed spatio-temporal scheduler that combines NVIDIA MPS with CUDA streams and a staging-based coarse-grained preemption. It replaces static WCET with a dynamic, per-stage Maximum Recent Execution Time (MRET) and uses virtual deadlines to allocate time across stages, enabling predictable HP performance while maximizing overall throughput. The offline-online strategy allocates contexts, performs admission tests, and schedules stages with eight fixed priority levels, achieving zero HP misses in many setups and substantial throughput gains over batching and state-of-the-art schedulers. The results show up to 15% throughput improvement over batching and 11.5% over GSlice, with oversubscription generally enhancing both throughput and timeliness, and batching remaining beneficial in certain networks like InceptionV3.
Abstract
The widespread use of Deep Neural Networks (DNNs) is limited by high computational demands, especially in constrained environments. GPUs, though effective accelerators, often face underutilization and rely on coarse-grained scheduling. This paper introduces DARIS, a priority-based real-time DNN scheduler for GPUs, utilizing NVIDIA's MPS and CUDA streaming for spatial sharing, and a synchronization-based staging method for temporal partitioning. In particular, DARIS improves GPU utilization and uniquely analyzes GPU concurrency by oversubscribing computing resources. It also supports zero-delay DNN migration between GPU partitions. Experiments show DARIS improves throughput by 15% and 11.5% over batching and state-of-the-art schedulers, respectively, even without batching. All high-priority tasks meet deadlines, with low-priority tasks having under 2% deadline miss rate. High-priority response times are 33% better than those of low-priority tasks.
