Table of Contents
Fetching ...

Measuring GPU utilization one level deeper

Paul Elvinger, Foteini Strati, Natalie Enright Jerger, Ana Klimovic

TL;DR

The paper tackles the problem of underutilization and unpredictable interference in GPU-driven workloads when colocated on the same device. It proposes a measurement-driven methodology that profiles interference across inter-SM and intra-SM resources using microbenchmarks and applies those insights to realism-tested kernels. The key contributions include identifying specific interference channels (block scheduling, L1/L2 caches, memory bandwidth, IPC, and pipelines), demonstrating multi-dimensional interference with microbenchmarks and real ML kernels, and outlining an interference-aware scheduler design and hardware requirements. The work emphasizes a practical impact: enabling higher GPU utilization with predictable performance by moving beyond coarse utilization metrics toward finer-grained, principled scheduling decisions.

Abstract

GPU hardware is vastly underutilized. Even resource-intensive AI applications have diverse resource profiles that often leave parts of GPUs idle. While colocating applications can improve utilization, current spatial sharing systems lack performance guarantees. Providing predictable performance guarantees requires a deep understanding of how applications contend for shared GPU resources such as block schedulers, compute units, L1/L2 caches, and memory bandwidth. We propose a methodology to profile resource interference of GPU kernels across these dimensions and discuss how to build GPU schedulers that provide strict performance guarantees while colocating applications to minimize cost.

Measuring GPU utilization one level deeper

TL;DR

The paper tackles the problem of underutilization and unpredictable interference in GPU-driven workloads when colocated on the same device. It proposes a measurement-driven methodology that profiles interference across inter-SM and intra-SM resources using microbenchmarks and applies those insights to realism-tested kernels. The key contributions include identifying specific interference channels (block scheduling, L1/L2 caches, memory bandwidth, IPC, and pipelines), demonstrating multi-dimensional interference with microbenchmarks and real ML kernels, and outlining an interference-aware scheduler design and hardware requirements. The work emphasizes a practical impact: enabling higher GPU utilization with predictable performance by moving beyond coarse utilization metrics toward finer-grained, principled scheduling decisions.

Abstract

GPU hardware is vastly underutilized. Even resource-intensive AI applications have diverse resource profiles that often leave parts of GPUs idle. While colocating applications can improve utilization, current spatial sharing systems lack performance guarantees. Providing predictable performance guarantees requires a deep understanding of how applications contend for shared GPU resources such as block schedulers, compute units, L1/L2 caches, and memory bandwidth. We propose a methodology to profile resource interference of GPU kernels across these dimensions and discuss how to build GPU schedulers that provide strict performance guarantees while colocating applications to minimize cost.

Paper Structure

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Simplified diagram of an NVIDIA GPU (based on an H100), focusing on a Streaming Multiprocessor.
  • Figure 2: L2 cache interference on an H100
  • Figure 3: L1 cache interference on an H100