ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

Munkyu Lee; Sihoon Seong; Minki Kang; Jihyuk Lee; Gap-Joo Na; In-Geol Chun; Dimitrios Nikolopoulos; Cheol-Ho Hong

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

Munkyu Lee, Sihoon Seong, Minki Kang, Jihyuk Lee, Gap-Joo Na, In-Geol Chun, Dimitrios Nikolopoulos, Cheol-Ho Hong

TL;DR

This paper proposes ParvaGPU, a technology that facilitates spatial GPU sharing for large-scale DNN inference in cloud computing and addresses the challenges of minimizing underutilization within allocated GPU space partitions and external fragmentation in combined MIG and MPS environments.

Abstract

In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption. However, previous studies have not fully achieved this objective. In this paper, we propose ParvaGPU, a technology that facilitates spatial GPU sharing for large-scale DNN inference in cloud computing. ParvaGPU integrates NVIDIA's Multi-Instance GPU (MIG) and Multi-Process Service (MPS) technologies to enhance GPU utilization, with the goal of meeting the diverse SLOs of each workload and reducing overall GPU usage. Specifically, ParvaGPU addresses the challenges of minimizing underutilization within allocated GPU space partitions and external fragmentation in combined MIG and MPS environments. We conducted our assessment on multiple A100 GPUs, evaluating 11 diverse DNN workloads with varying SLOs. Our evaluation revealed no SLO violations and a significant reduction in GPU usage compared to state-of-the-art frameworks.

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

TL;DR

Abstract

Paper Structure (29 sections, 4 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 4 equations, 10 figures, 4 tables, 2 algorithms.

Introduction
Background and Related Work
Multi-Process Service (MPS)
Multi-Instance GPU (MIG)
Design
Overall Design
Workload Characteristic Analysis
Profiler
GPU Segment Configurator
Optimal Triplet Decision
Demand Matching
GPU Segment Allocator
Segment Relocation
Allocation Optimization
Deployment
...and 14 more sections

Figures (10)

Figure 1: Supported MIG configurations on the NVIDIA A100 GPU.
Figure 2: Overall design of ParvaGPU.
Figure 3: Throughput (requests/s) of InceptionV3 with different batch sizes and instance sizes for each process count of 1 (a), 2 (b), and 3 (c).
Figure 5: Total number of GPUs of each baseline and ParvaGPU.
Figure 6: Internal slack rate of each baseline and ParvaGPU.
...and 5 more figures

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

TL;DR

Abstract

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (10)