Table of Contents
Fetching ...

SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yunzhe Li, Zhifeng Jiang, Yang Li, Xiaowen Chu, Huaicheng Li

TL;DR

This paper introduces SGDRC, a fully software-defined system for dynamic VRAM bandwidth and compute-unit allocation to enable concurrent LS and BE DNN inference on NVIDIA GPUs. It leverages reverse engineering to map VRAM channels, trains a DNN to approximate the VRAM channel hash function, and uses software techniques like shadow page tables and cache coloring to isolate channel conflicts. SGDRC then dynamically allocates VRAM channels and SMs via bimodal tensors and tidal SM masking, guided by offline profiling, to maximize throughput while preserving LS latency. Empirical results on two GPUs show SGDRC achieves an average SLO attainment of 99.0% and improves overall throughput up to 1.47× and BE throughput up to 2.36× over state-of-the-art baselines, demonstrating practical impact for data-center GPU sharing.

Abstract

Cloud service providers heavily colocate high-priority, latency-sensitive (LS), and low-priority, best-effort (BE) DNN inference services on the same GPU to improve resource utilization in data centers. Among the critical shared GPU resources, there has been very limited analysis on the dynamic allocation of compute units and VRAM bandwidth, mainly for two reasons: (1) The native GPU resource management solutions are either hardware-specific, or unable to dynamically allocate resources to different tenants, or both; (2) NVIDIA doesn't expose interfaces for VRAM bandwidth allocation, and the software stack and VRAM channel architectures are black-box, both of which limit the software-level resource management. These drive prior work to design either conservative sharing policies detrimental to throughput, or static resource partitioning only applicable to a few GPU models. To bridge this gap, this paper proposes SGDRC, a fully software-defined dynamic VRAM bandwidth and compute unit management solution for concurrent DNN inference services. SGDRC aims at guaranteeing service quality, maximizing the overall throughput, and providing general applicability to NVIDIA GPUs. SGDRC first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs through comprehensive reverse engineering and eliminates VRAM channel conflicts using software-level cache coloring. SGDRC applies bimodal tensors and tidal SM masking to dynamically allocate VRAM bandwidth and compute units, and guides the allocation of resources based on offline profiling. We evaluate 11 mainstream DNNs with real-world workloads on two NVIDIA GPUs. The results show that compared with the state-of-the-art GPU sharing solutions, SGDRC achieves the highest SLO attainment rates (99.0% on average), and improves overall throughput by up to 1.47x and BE job throughput by up to 2.36x.

SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs

TL;DR

This paper introduces SGDRC, a fully software-defined system for dynamic VRAM bandwidth and compute-unit allocation to enable concurrent LS and BE DNN inference on NVIDIA GPUs. It leverages reverse engineering to map VRAM channels, trains a DNN to approximate the VRAM channel hash function, and uses software techniques like shadow page tables and cache coloring to isolate channel conflicts. SGDRC then dynamically allocates VRAM channels and SMs via bimodal tensors and tidal SM masking, guided by offline profiling, to maximize throughput while preserving LS latency. Empirical results on two GPUs show SGDRC achieves an average SLO attainment of 99.0% and improves overall throughput up to 1.47× and BE throughput up to 2.36× over state-of-the-art baselines, demonstrating practical impact for data-center GPU sharing.

Abstract

Cloud service providers heavily colocate high-priority, latency-sensitive (LS), and low-priority, best-effort (BE) DNN inference services on the same GPU to improve resource utilization in data centers. Among the critical shared GPU resources, there has been very limited analysis on the dynamic allocation of compute units and VRAM bandwidth, mainly for two reasons: (1) The native GPU resource management solutions are either hardware-specific, or unable to dynamically allocate resources to different tenants, or both; (2) NVIDIA doesn't expose interfaces for VRAM bandwidth allocation, and the software stack and VRAM channel architectures are black-box, both of which limit the software-level resource management. These drive prior work to design either conservative sharing policies detrimental to throughput, or static resource partitioning only applicable to a few GPU models. To bridge this gap, this paper proposes SGDRC, a fully software-defined dynamic VRAM bandwidth and compute unit management solution for concurrent DNN inference services. SGDRC aims at guaranteeing service quality, maximizing the overall throughput, and providing general applicability to NVIDIA GPUs. SGDRC first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs through comprehensive reverse engineering and eliminates VRAM channel conflicts using software-level cache coloring. SGDRC applies bimodal tensors and tidal SM masking to dynamically allocate VRAM bandwidth and compute units, and guides the allocation of resources based on offline profiling. We evaluate 11 mainstream DNNs with real-world workloads on two NVIDIA GPUs. The results show that compared with the state-of-the-art GPU sharing solutions, SGDRC achieves the highest SLO attainment rates (99.0% on average), and improves overall throughput by up to 1.47x and BE job throughput by up to 2.36x.
Paper Structure (31 sections, 19 figures, 4 tables, 3 algorithms)

This paper contains 31 sections, 19 figures, 4 tables, 3 algorithms.

Figures (19)

  • Figure 1: Illustration of existing GPU sharing schemes and SGDRC. The gray (or colored) rectangles represent GPU resources (or DNN kernels). The width (or height) of a colored rectangle represents the runtime (or resource utilization) of a DNN kernel.
  • Figure 2: NVIDIA GPU's architecture and the software stack.
  • Figure 3: Resource contention in GPU sharing. We measure the p99 latency of the victim task to quantify the interference. L1C (or Comp.) in (a) denotes the introduction of L1 cache (or compute unit) interference. Testbed: RTX A2000.
  • Figure 4: Limitations of GPU temporal and spatial multiplexing. (a) Temporal multiplexing Wu2023Gujarati2020 cannot achieve high throughput for BE tasks; (b) Spatial multiplexing Zhao2023 can achieve high throughput, but at the cost of sacrificing the LS task's SLO attainment rate (defined in §\ref{['subsec:final_experimental_setup']}) due to resource contention; LS workload: MobileNet V3; BE workload: DenseNet161; Testbed: RTX A2000.
  • Figure 5: Interference-aware multiplexing is not panacea. (a) As the load increases, the LS service maintains high SLO attainment rate. However, the throughput of BE task substantially declines. LS Workload: MobileNet V3; BE Workload: DenseNet161; Testbed: RTX A2000. (b) Analysis of scheduling constraints of BE tasks (I $\sim$ K in Tab. \ref{['tab:list_of_testing_models']}, running on RTX A2000). Res.: Constraints on SM or VRAM bandwidth utilization; SM: Constraints on the required number of SMs; Runtime: Constraints on kernel runtime.
  • ...and 14 more figures