Table of Contents
Fetching ...

PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers

Gwangoo Yeo, Jiin Kim, Yujeong Choi, Minsoo Rhu

TL;DR

The paper addresses the challenge of underutilization and latency in MIG-based AI inference servers by identifying CPU preprocessing bottlenecks and batching inefficiencies. It proposes PREBA, a hardware/software co-design that offloads preprocessing to an FPGA-based DPU and employs a dynamic batching system tailored to MIG's vGPU partitioning, Batch_knee, and Time_knee. Through end-to-end evaluation on real hardware and six AI workloads, PREBA delivers substantial gains: approximately 3.7x throughput improvement, 3.4x tail-latency reduction, 3.5x energy-efficiency, and 3.0x cost-efficiency on average, closely approaching an ideal preprocessing-free baseline. The work demonstrates the practical impact of MIG-aware preprocessing acceleration and dynamic batching for latency-critical AI inference, with broad implications for MIG-enabled AIaaS deployments and TCO optimization.

Abstract

NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in throughput, 3.4x reduction in tail latency, 3.5x improvement in energy-efficiency, and 3.0x improvement in cost-efficiency.

PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers

TL;DR

The paper addresses the challenge of underutilization and latency in MIG-based AI inference servers by identifying CPU preprocessing bottlenecks and batching inefficiencies. It proposes PREBA, a hardware/software co-design that offloads preprocessing to an FPGA-based DPU and employs a dynamic batching system tailored to MIG's vGPU partitioning, Batch_knee, and Time_knee. Through end-to-end evaluation on real hardware and six AI workloads, PREBA delivers substantial gains: approximately 3.7x throughput improvement, 3.4x tail-latency reduction, 3.5x energy-efficiency, and 3.0x cost-efficiency on average, closely approaching an ideal preprocessing-free baseline. The work demonstrates the practical impact of MIG-aware preprocessing acceleration and dynamic batching for latency-critical AI inference, with broad implications for MIG-enabled AIaaS deployments and TCO optimization.

Abstract

NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in throughput, 3.4x reduction in tail latency, 3.5x improvement in energy-efficiency, and 3.0x improvement in cost-efficiency.

Paper Structure

This paper contains 21 sections, 22 figures, 1 table.

Figures (22)

  • Figure 1: Overview of NVIDIA's GPU architecture.
  • Figure 2: MIG partitioning options in NVIDIA A100 GPU.
  • Figure 3: End-to-end AI inference pipeline.
  • Figure 4: Data preprocessing operations for (a) computer vision and (b) audio processing.
  • Figure 5: (Bar chart) Model execution throughput and (line chart) its GPU utilization when preprocessing is disabled. The x-axis shows the input batch size executed by a single vGPU.
  • ...and 17 more figures