Table of Contents
Fetching ...

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration

Tianyu Wang, Sheng Li, Bingyao Li, Yue Dai, Ao Li, Geng Yuan, Yufei Ding, Youtao Zhang, Xulong Tang

TL;DR

This paper addresses the challenge of running continuous learning workloads with co-located retraining and inference on cloud GPUs, where both timely model updates and strict inference SLOs must be met. It introduces MIGRator, a dynamic GPU reconfiguration runtime built on NVIDIA MIG, which casts resource allocation as an ILP problem and optimizes a Goodput-based objective that jointly accounts for inference timeliness and post-retraining accuracy. The approach enables fine-grained (per-second) reconfiguration, eliminates cross-tenant interference via MIG isolation, and leverages a pre-initialization technique to reduce reconfiguration overhead, all evaluated against state-of-the-art GPU sharing methods. Empirical results on six DNN models, real inference traces, and three CL retraining datasets show MIGRator delivering up to 21% higher Goodput and up to 83% lower reconfiguration overhead, demonstrating practical impact for cloud ML deployment with dynamic workload mix.

Abstract

Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment. Moreover, they cannot accommodate the overtime dynamics (e.g., inference arrival intensity) in CL execution. In this paper, we propose MIGRator, a novel GPU reconfiguration runtime that dynamically performs GPU reconfiguration for multi-tenancy CL workloads. MIGRator is based on the recent NVIDIA multi-instance GPU (MIG) to mitigate resource contention and formulates the reconfiguration optimization into Integer Linear Programming (ILP) to dynamically identify, reconfigure, and allocate the GPU instances. MIGRator leverages the "Goodput" metric in the ILP objective function to consider both inference SLO attainment and model accuracy in the reconfiguration exploration. We evaluate MIGRator using representative multi-tenancy CL workloads. The results show our approach outperforms the state-of-the-art GPU sharing techniques (i.e., Ekya, Astraea, and PARIS) by 17\%, 21\%, and 20\%, respectively.

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration

TL;DR

This paper addresses the challenge of running continuous learning workloads with co-located retraining and inference on cloud GPUs, where both timely model updates and strict inference SLOs must be met. It introduces MIGRator, a dynamic GPU reconfiguration runtime built on NVIDIA MIG, which casts resource allocation as an ILP problem and optimizes a Goodput-based objective that jointly accounts for inference timeliness and post-retraining accuracy. The approach enables fine-grained (per-second) reconfiguration, eliminates cross-tenant interference via MIG isolation, and leverages a pre-initialization technique to reduce reconfiguration overhead, all evaluated against state-of-the-art GPU sharing methods. Empirical results on six DNN models, real inference traces, and three CL retraining datasets show MIGRator delivering up to 21% higher Goodput and up to 83% lower reconfiguration overhead, demonstrating practical impact for cloud ML deployment with dynamic workload mix.

Abstract

Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment. Moreover, they cannot accommodate the overtime dynamics (e.g., inference arrival intensity) in CL execution. In this paper, we propose MIGRator, a novel GPU reconfiguration runtime that dynamically performs GPU reconfiguration for multi-tenancy CL workloads. MIGRator is based on the recent NVIDIA multi-instance GPU (MIG) to mitigate resource contention and formulates the reconfiguration optimization into Integer Linear Programming (ILP) to dynamically identify, reconfigure, and allocate the GPU instances. MIGRator leverages the "Goodput" metric in the ILP objective function to consider both inference SLO attainment and model accuracy in the reconfiguration exploration. We evaluate MIGRator using representative multi-tenancy CL workloads. The results show our approach outperforms the state-of-the-art GPU sharing techniques (i.e., Ekya, Astraea, and PARIS) by 17\%, 21\%, and 20\%, respectively.
Paper Structure (18 sections, 12 equations, 10 figures, 4 tables)

This paper contains 18 sections, 12 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: MIG instance configurations on NVIDIA A100.
  • Figure 2: SLO attainment(%) and inference accuracy(%) of ResNet50 under various GPC allocations in four retraining windows. The leftmost column lists GPC allocations, where '6-1' indicates 6 GPCs are allocated to the inference task, and 1 GPC is allocated to the retraining task.
  • Figure 3: The figure displays the inference request per second for both the Azure and Alibaba traces within a single retraining window. The first row of the table lists GPC allocations, where '1-6' indicates Inception is allocated 1 GPCs while ResNet50 is allocated 6 GPCs.
  • Figure 4: Inference accuracy(%) under different GPC allocations for retraining tasks in each retraining window. The leftmost column of the table represents the GPC allocations, where '1-4' indicates that the Inception retraining task is allocated a 1-GPC instance and the ResNet50 retraining task is allocated a 4-GPC instance. The highest inference accuracy for each retraining window is highlighted.
  • Figure 5: Reconfiguration overhead for ConvNeXt, ResNet50 and MobileNet.
  • ...and 5 more figures