Table of Contents
Fetching ...

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale

Dan Zhao, Siddharth Samsi, Joseph McDonald, Baolin Li, David Bestor, Michael Jones, Devesh Tiwari, Vijay Gadepally

TL;DR

This work addresses AI compute sustainability by empirically studying GPU power capping at an HPC scale (MIT Supercloud). Using a detailed, large-scale dataset and causal inference methods, it quantifies how power caps reduce GPU temperatures and energy, while assessing the potential and limits of system-wide energy savings given user and scheduler responses. The study provides statistically significant evidence that power capping yields meaningful per-job energy and thermal reductions, with nuanced effects on performance that depend on cap strength and workload efficiency. These findings offer actionable guidance for HPC/datacenter operators aiming to balance AI acceleration needs with sustainability and hardware longevity.

Abstract

As research and deployment of AI grows, the computational burden to support and sustain its progress inevitably does too. To train or fine-tune state-of-the-art models in NLP, computer vision, etc., some form of AI hardware acceleration is virtually a requirement. Recent large language models require considerable resources to train and deploy, resulting in significant energy usage, potential carbon emissions, and massive demand for GPUs and other hardware accelerators. However, this surge carries large implications for energy sustainability at the HPC/datacenter level. In this paper, we study the aggregate effect of power-capping GPUs on GPU temperature and power draw at a research supercomputing center. With the right amount of power-capping, we show significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span with minimal impact on job performance. While power-capping reduces power draw by design, the aggregate system-wide effect on overall energy consumption is less clear; for instance, if users notice job performance degradation from GPU power-caps, they may request additional GPU-jobs to compensate, negating any energy savings or even worsening energy consumption. To our knowledge, our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale. We hope our work will inspire HPCs/datacenters to further explore, evaluate, and communicate the impact of power-capping AI hardware accelerators for more sustainable AI.

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale

TL;DR

This work addresses AI compute sustainability by empirically studying GPU power capping at an HPC scale (MIT Supercloud). Using a detailed, large-scale dataset and causal inference methods, it quantifies how power caps reduce GPU temperatures and energy, while assessing the potential and limits of system-wide energy savings given user and scheduler responses. The study provides statistically significant evidence that power capping yields meaningful per-job energy and thermal reductions, with nuanced effects on performance that depend on cap strength and workload efficiency. These findings offer actionable guidance for HPC/datacenter operators aiming to balance AI acceleration needs with sustainability and hardware longevity.

Abstract

As research and deployment of AI grows, the computational burden to support and sustain its progress inevitably does too. To train or fine-tune state-of-the-art models in NLP, computer vision, etc., some form of AI hardware acceleration is virtually a requirement. Recent large language models require considerable resources to train and deploy, resulting in significant energy usage, potential carbon emissions, and massive demand for GPUs and other hardware accelerators. However, this surge carries large implications for energy sustainability at the HPC/datacenter level. In this paper, we study the aggregate effect of power-capping GPUs on GPU temperature and power draw at a research supercomputing center. With the right amount of power-capping, we show significant decreases in both temperature and power draw, reducing power consumption and potentially improving hardware life-span with minimal impact on job performance. While power-capping reduces power draw by design, the aggregate system-wide effect on overall energy consumption is less clear; for instance, if users notice job performance degradation from GPU power-caps, they may request additional GPU-jobs to compensate, negating any energy savings or even worsening energy consumption. To our knowledge, our work is the first to conduct and make available a detailed analysis of the effects of GPU power-capping at the supercomputing scale. We hope our work will inspire HPCs/datacenters to further explore, evaluate, and communicate the impact of power-capping AI hardware accelerators for more sustainable AI.
Paper Structure (12 sections, 3 equations, 5 figures, 7 tables)

This paper contains 12 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Distribution of GPU temperatures (Celsius) from jobs with and without power capping. Shown here are the empirical distribution of average temperatures as well as distributions of temperatures of the 10th, 50th, and 90th percentiles across individual jobs with and without power caps. Note that the y-axis for frequency is on log scale.
  • Figure 2: Distribution of average standard deviation of GPU temperatures from jobs with and without power capping. Shown here are the distribution of average standard deviation of GPU temperatures of individual jobs with and without power caps. We note that power-capped jobs show a smaller and more stable range of temperature fluctuations than uncapped jobs.
  • Figure 3: Distribution of GPU power draw (Watts) from jobs with and without power capping. Shown here are the distribution of average temperatures as well as distributions of temperatures of the 10th, 50th, and 90th percentiles across individual jobs with and without power caps.
  • Figure 4: Distribution of average standard deviation of GPU power draw from jobs with and without power capping. Shown here are the distribution of average standard deviation of GPU temperatures of individual jobs with and without power caps.
  • Figure 5: Optimal power-capping GPUs can decrease energy expenditure with minimal adverse impact on training speed. Stricter power caps (100W) can further reduce energy but disproportionately degrades training speed. Speed and energy values are normalized to training speed and energy without power capping (e.g., a value of 0.8 corresponds to a 20$\%$ decrease in speed/energy relative to no power caps).