ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs
Xinning Hui, Yuanchao Xu, Zhishan Guo, Xipeng Shen
TL;DR
ESG tackles the challenge of scheduling DNN inferences on serverless platforms with sharable GPUs by treating GPUs as a first-class resource and using an optimality-guided two-step approach. ESG_1Q formulates scheduling as a path-finding problem and employs A*-search with dual-blade pruning, while dominator-based SLO distribution scales the method to large DAG-like workflows. ESG_Dispatch maps configurations to Invokers with locality-aware decisions to maximize data locality and warm starts. Empirical results show ESG delivering 61-80% higher SLO hit rates and 47-187% cost savings over prior work, with overhead under 10 ms and robustness to workload variations. This work meaningfully enables cost-effective, scalable GPU sharing for real-time ML inference on serverless platforms.
Abstract
Recent years have witnessed increasing interest in machine learning inferences on serverless computing for its auto-scaling and cost effective properties. Existing serverless computing, however, lacks effective job scheduling methods to handle the schedule space dramatically expanded by GPU sharing, task batching, and inter-task relations. Prior solutions have dodged the issue by neglecting some important factors, leaving some large performance potential locked. This paper presents ESG, a new scheduling algorithm that directly addresses the difficulties. ESG treats sharable GPU as a first-order factor in scheduling. It employs an optimality-guided adaptive method by combining A*-search and a novel dual-blade pruning to dramatically prune the scheduling space without compromising the quality. It further introduces a novel method, dominator-based SLO distribution, to ensure the scalability of the scheduler. The results show that ESG can significantly improve the SLO hit rates 61%-80% while saving 47%-187% costs over prior work.
