Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
Cunchi Lv, Xiao Shi, Zhengyu Lei, Jinyue Huang, Wenting Tan, Xiaohui Zheng, Xiaofang Zhao
TL;DR
This paper addresses GPU fragmentation in serverless DL serving by introducing introspective elasticity (IE), a cross-layer approach that enables GPU resourcing-on-demand through 2D co-scaling. The authors design Dilu with three core components: multi-factor profiling (training via binary search and inference via Hybrid Growth Search using a throughput-efficiency metric $TE$), resourcing-complementary scheduling (a 2D bin-packing-like heuristic across $SMR$ and memory), and an adaptive 2D co-scaling mechanism that combines fast vertical scaling with lazy horizontal scaling via a kernel-token system (RCKM/IL). The evaluation demonstrates substantial improvements: reduced GPU fragmentation by $10\%$-$46\%$, inference throughput up to $1.8\times$ and training throughput up to $1.1\times$ the baselines, and QoS improvements with $11\%$-$71\%$ fewer violations, plus a public Kubernetes-based prototype. Collectively, Dilu enables GPU resourcing-on-demand for serverless DL, improves deployment density and resource utilization, and potentially lowers costs in large-scale DL serving scenarios.
Abstract
Serverless computing, with its ease of management, auto-scaling, and cost-effectiveness, is widely adopted by deep learning (DL) applications. DL workloads, especially with large language models, require substantial GPU resources to ensure QoS. However, it is prone to produce GPU fragments (e.g., 15\%-94\%) in serverless DL systems due to the dynamicity of workloads and coarse-grained static GPU allocation mechanisms, gradually eroding the profits offered by serverless elasticity. Different from classical serverless systems that only scale horizontally, we present introspective elasticity (IE), a fine-grained and adaptive two-dimensional co-scaling mechanism to support GPU resourcing-on-demand for serverless DL tasks. Based on this insight, we build Dilu, a cross-layer and GPU-based serverless DL system with IE support. First, Dilu provides multi-factor profiling for DL tasks with efficient pruning search methods. Second, Dilu adheres to the resourcing-complementary principles in scheduling to improve GPU utilization with QoS guarantees. Third, Dilu adopts an adaptive 2D co-scaling method to enhance the elasticity of GPU provisioning in real time. Evaluations show that it can dynamically adjust the resourcing of various DL functions with low GPU fragmentation (10\%-46\% GPU defragmentation), high throughput (up to 1.8$\times$ inference and 1.1$\times$ training throughput increment) and QoS guarantees (11\%-71\% violation rate reduction), compared to the SOTA baselines.
