Table of Contents
Fetching ...

Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

Imran Latif, Alex C. Newkirk, Matthew R. Carbone, Arslan Munir, Yuewei Lin, Jonathan Koomey, Xi Yu, Zhiuha Dong

TL;DR

The paper tackles the problem of quantifying the energy footprint of AI training by performing empirical, node-level power measurements on an 8-GPU NVIDIA H100 HGX node across ResNet and Llama2-13b workloads, complemented by GPU-burn tests to bound the power envelope. The authors demonstrate that the observed peak power ($ ext{max} ext{ power} ext{ = }$ $8.48$ kW) is substantially below the manufacturer rating ($10.2$ kW) and show a significant energy trade-off when increasing ResNet batch size from $512$ to $4096$ (roughly $4\times$ less total energy with a $1$ kW higher average power). The work provides concrete node-level data to improve energy-use models and capacity planning for data centers, with implications for sustainability assessments and infrastructure provisioning. It also outlines future directions, including the impact of cooling technologies and carbon-aware scheduling on AI workloads.

Abstract

The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during the training of open-source image classifier (ResNet) and large-language models (Llama2-13b). The maximum observed power draw was approximately 8.4 kW, 18% lower than the manufacturer-rated 10.2 kW, even with GPUs near full utilization. Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4. These findings can inform capacity planning for data center operators and energy use estimates by researchers. Future work will investigate the impact of cooling technology and carbon-aware scheduling on AI workload energy consumption.

Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

TL;DR

The paper tackles the problem of quantifying the energy footprint of AI training by performing empirical, node-level power measurements on an 8-GPU NVIDIA H100 HGX node across ResNet and Llama2-13b workloads, complemented by GPU-burn tests to bound the power envelope. The authors demonstrate that the observed peak power ( kW) is substantially below the manufacturer rating ( kW) and show a significant energy trade-off when increasing ResNet batch size from to (roughly less total energy with a kW higher average power). The work provides concrete node-level data to improve energy-use models and capacity planning for data centers, with implications for sustainability assessments and infrastructure provisioning. It also outlines future directions, including the impact of cooling technologies and carbon-aware scheduling on AI workloads.

Abstract

The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during the training of open-source image classifier (ResNet) and large-language models (Llama2-13b). The maximum observed power draw was approximately 8.4 kW, 18% lower than the manufacturer-rated 10.2 kW, even with GPUs near full utilization. Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4. These findings can inform capacity planning for data center operators and energy use estimates by researchers. Future work will investigate the impact of cooling technology and carbon-aware scheduling on AI workload energy consumption.

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Instantaneous pre-training power demand measured in kW. Data corresponds to training ResNet with a batch size of 512 and 4096 images. Total energy usage for these two cases were computed by integrating the curves and are 123 and 30 kWh, respectively.
  • Figure 2: Instantaneous pre-training power demand measured in kW, and average total GPU load, for the Llama2-13b parameter model. The rated power of the system (10.2 kW) is shown as a horizontal dashed line. The median node power draw from the GPU$+$CPU burn control experiment is shown as a horizontal dotted line (at 8.43 kW).