Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

Imran Latif; Alex C. Newkirk; Matthew R. Carbone; Arslan Munir; Yuewei Lin; Jonathan Koomey; Xi Yu; Zhiuha Dong

Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

Imran Latif, Alex C. Newkirk, Matthew R. Carbone, Arslan Munir, Yuewei Lin, Jonathan Koomey, Xi Yu, Zhiuha Dong

TL;DR

The paper tackles the problem of quantifying the energy footprint of AI training by performing empirical, node-level power measurements on an 8-GPU NVIDIA H100 HGX node across ResNet and Llama2-13b workloads, complemented by GPU-burn tests to bound the power envelope. The authors demonstrate that the observed peak power ($ ext{max} ext{ power} ext{ = }$ $8.48$ kW) is substantially below the manufacturer rating ($10.2$ kW) and show a significant energy trade-off when increasing ResNet batch size from $512$ to $4096$ (roughly $4\times$ less total energy with a $1$ kW higher average power). The work provides concrete node-level data to improve energy-use models and capacity planning for data centers, with implications for sustainability assessments and infrastructure provisioning. It also outlines future directions, including the impact of cooling technologies and carbon-aware scheduling on AI workloads.

Abstract

The expansion of artificial intelligence (AI) applications has driven substantial investment in computational infrastructure, especially by cloud computing providers. Quantifying the energy footprint of this infrastructure requires models parameterized by the power demand of AI hardware during training. We empirically measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node during the training of open-source image classifier (ResNet) and large-language models (Llama2-13b). The maximum observed power draw was approximately 8.4 kW, 18% lower than the manufacturer-rated 10.2 kW, even with GPUs near full utilization. Holding model architecture constant, increasing batch size from 512 to 4096 images for ResNet reduced total training energy consumption by a factor of 4. These findings can inform capacity planning for data center operators and energy use estimates by researchers. Future work will investigate the impact of cooling technology and carbon-aware scheduling on AI workload energy consumption.

Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

TL;DR

Abstract

Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)