On the Partitioning of GPU Power among Multi-Instances

Tirth Vamja; Kaustabha Ray; Felix George; UmaMaheswari C Devi

On the Partitioning of GPU Power among Multi-Instances

Tirth Vamja, Kaustabha Ray, Felix George, UmaMaheswari C Devi

TL;DR

This work tackles the problem of partitioning GPU power among MIG partitions in data centers. It develops ML-based offline and online estimators using DCGM metrics to predict partition-level power on NVIDIA A100 GPUs, and analyzes CUDA and Tensor workloads across A100 and V100 to reveal workload and hardware heterogeneity. The study finds that no single offline full-GPU model generalizes across diverse workloads, and demonstrates online MIG-level predictors with scaling to align aggregated estimates with measured GPU power, enabling fair and transparent carbon reporting. The results show that online, workload-aware power attribution improves accuracy for partitioned workloads such as matrix multiplication and LLM inference, with practical implications for power accounting and sustainability in cloud environments.

Abstract

Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA's Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple tenants. However, accurately apportioning GPU power consumption among MIG instances remains challenging due to a lack of hardware support. This paper addresses this challenge by developing software methods to estimate power usage per MIG partition. We analyze NVIDIA GPU utilization metrics and find that light-weight methods with good accuracy can be difficult to construct. We hence explore the use of ML-based power models to enable accurate, partition-level power estimation. Our findings reveal that a single generic offline power model or modeling method is not applicable across diverse workloads, especially with concurrent MIG usage, and that online models constructed using partition-level utilization metrics of workloads under execution can significantly improve accuracy. Using NVIDIA A100 GPUs, we demonstrate this approach for accurate partition-level power estimation for workloads including matrix multiplication and Large Language Model inference, contributing to transparent and fair carbon reporting.

On the Partitioning of GPU Power among Multi-Instances

TL;DR

Abstract

On the Partitioning of GPU Power among Multi-Instances

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)