SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs
Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yunzhe Li, Zhifeng Jiang, Yang Li, Xiaowen Chu, Huaicheng Li
TL;DR
This paper introduces SGDRC, a fully software-defined system for dynamic VRAM bandwidth and compute-unit allocation to enable concurrent LS and BE DNN inference on NVIDIA GPUs. It leverages reverse engineering to map VRAM channels, trains a DNN to approximate the VRAM channel hash function, and uses software techniques like shadow page tables and cache coloring to isolate channel conflicts. SGDRC then dynamically allocates VRAM channels and SMs via bimodal tensors and tidal SM masking, guided by offline profiling, to maximize throughput while preserving LS latency. Empirical results on two GPUs show SGDRC achieves an average SLO attainment of 99.0% and improves overall throughput up to 1.47× and BE throughput up to 2.36× over state-of-the-art baselines, demonstrating practical impact for data-center GPU sharing.
Abstract
Cloud service providers heavily colocate high-priority, latency-sensitive (LS), and low-priority, best-effort (BE) DNN inference services on the same GPU to improve resource utilization in data centers. Among the critical shared GPU resources, there has been very limited analysis on the dynamic allocation of compute units and VRAM bandwidth, mainly for two reasons: (1) The native GPU resource management solutions are either hardware-specific, or unable to dynamically allocate resources to different tenants, or both; (2) NVIDIA doesn't expose interfaces for VRAM bandwidth allocation, and the software stack and VRAM channel architectures are black-box, both of which limit the software-level resource management. These drive prior work to design either conservative sharing policies detrimental to throughput, or static resource partitioning only applicable to a few GPU models. To bridge this gap, this paper proposes SGDRC, a fully software-defined dynamic VRAM bandwidth and compute unit management solution for concurrent DNN inference services. SGDRC aims at guaranteeing service quality, maximizing the overall throughput, and providing general applicability to NVIDIA GPUs. SGDRC first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs through comprehensive reverse engineering and eliminates VRAM channel conflicts using software-level cache coloring. SGDRC applies bimodal tensors and tidal SM masking to dynamically allocate VRAM bandwidth and compute units, and guides the allocation of resources based on offline profiling. We evaluate 11 mainstream DNNs with real-world workloads on two NVIDIA GPUs. The results show that compared with the state-of-the-art GPU sharing solutions, SGDRC achieves the highest SLO attainment rates (99.0% on average), and improves overall throughput by up to 1.47x and BE job throughput by up to 2.36x.
