Table of Contents
Fetching ...

Wattchmen: Watching the Wattchers -- High Fidelity, Flexible GPU Energy Modeling

Brandon Tran, Matthias Maiterth, Woong Shin, Matthew D. Sinclair, Shivaram Venkataraman

Abstract

Modern GPU-rich HPC systems are increasingly becoming energy-constrained. Thus, understanding an application's energy consumption becomes essential. Unfortunately, current GPU energy attribution techniques are either inaccurate, inflexible, or outdated. Therefore, we propose Wattchmen, a flexible methodology for measuring, attributing, and predicting GPU energy consumption. We construct a per-instruction energy model using a diverse set of microbenchmarks to systematically quantify the energy consumption of GPU instructions, enabling finer-grain prediction and energy consumption breakdowns for applications. Compared with the state-of-the-art systems like AccelWattch (32%) and Guser (25%), across 16 popular GPGPU, graph analytics, HPC, and ML workloads, Wattchmen reduces the mean absolute percent error (MAPE) to 14% on V100 GPUs. Furthermore, we show that Wattchmen provides similar MAPEs for water-cooled V100s (15%) and extends to later architectures, including air-cooled A100 (11%) and H100 (12%) GPUs. Finally, to further demonstrate Wattchmen's value, we apply it to applications such as Backprop and QMCPACK, where Wattchmen's insights enable energy reductions of up to 35%.

Wattchmen: Watching the Wattchers -- High Fidelity, Flexible GPU Energy Modeling

Abstract

Modern GPU-rich HPC systems are increasingly becoming energy-constrained. Thus, understanding an application's energy consumption becomes essential. Unfortunately, current GPU energy attribution techniques are either inaccurate, inflexible, or outdated. Therefore, we propose Wattchmen, a flexible methodology for measuring, attributing, and predicting GPU energy consumption. We construct a per-instruction energy model using a diverse set of microbenchmarks to systematically quantify the energy consumption of GPU instructions, enabling finer-grain prediction and energy consumption breakdowns for applications. Compared with the state-of-the-art systems like AccelWattch (32%) and Guser (25%), across 16 popular GPGPU, graph analytics, HPC, and ML workloads, Wattchmen reduces the mean absolute percent error (MAPE) to 14% on V100 GPUs. Furthermore, we show that Wattchmen provides similar MAPEs for water-cooled V100s (15%) and extends to later architectures, including air-cooled A100 (11%) and H100 (12%) GPUs. Finally, to further demonstrate Wattchmen's value, we apply it to applications such as Backprop and QMCPACK, where Wattchmen's insights enable energy reductions of up to 35%.

Paper Structure

This paper contains 29 sections, 3 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparing AccelWattch's energy predictions to measurements from air-cooled Tesla V100 GPUs across various benchmarks, which leads to a 32% MAPE. The blue line indicates perfect prediction.
  • Figure 2: Wattchmen design overview.
  • Figure 3: Subset of the full system of equations used to solve for instructions for the air-cooled V100 GPU. Each row represents one microbenchmark, and each column represents the frequency of a target instruction occurring in the benchmark (selection). The full table for the V100 GPU includes 90 microbenchmarks covering 90 instructions.
  • Figure 4: Power trace sampled with NVML from running a double precision addition microbenchmark on an air-cooled Tesla V100 GPU, including GPU utilization (red) and GPU power (blue).
  • Figure 5: Simple microbenchmark that loops two different instructions. Base: 2mul and 2add; Additional Mul: 4mul and 4add; 2x Base: 4mul + 4add.
  • ...and 9 more figures