Table of Contents
Fetching ...

Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs

Melanie Cornelius, Greg Cross, Shilpika Shilpika, Matthew T. Dearing, Zhiling Lan

TL;DR

This work tackles the challenge of understanding GPU power consumption in production HPC workloads by co-analyzing diverse system logs from the Polaris supercomputer. It introduces a data preprocessing and post-processing workflow that unifies heterogeneous telemetry and scheduler data, yielding actionable insights while reducing data volume by 94%. The study reveals that GPU power is heavily dominated by idle power, memory utilization is generally low, and memory allocation has limited direct impact on power, motivating practical opportunities such as idle-power reduction and job-level power management, supported by RI-based metrics for variability. The authors provide an open-source analysis tool and dataset to enable reproducible, system-wide power optimization in heterogeneous HPC environments and outline avenues for extending the work to other systems and deeper domain analyses.

Abstract

As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-analysis approach using system data collected from the Polaris supercomputer at Argonne National Laboratory. We focus on GPU utilization and power demands, navigating the complexities of large-scale, heterogeneous datasets. Our approach, which incorporates data preprocessing, post-processing, and statistical methods, condenses the data volume by 94% while preserving essential insights. Through this analysis, we uncover key opportunities for power optimization, such as reducing high idle power costs, applying power strategies at the job-level, and aligning GPU power allocation with workload demands. Our findings provide actionable insights for energy-efficient computing and offer a practical, reproducible approach for applying existing research to optimize system performance.

Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs

TL;DR

This work tackles the challenge of understanding GPU power consumption in production HPC workloads by co-analyzing diverse system logs from the Polaris supercomputer. It introduces a data preprocessing and post-processing workflow that unifies heterogeneous telemetry and scheduler data, yielding actionable insights while reducing data volume by 94%. The study reveals that GPU power is heavily dominated by idle power, memory utilization is generally low, and memory allocation has limited direct impact on power, motivating practical opportunities such as idle-power reduction and job-level power management, supported by RI-based metrics for variability. The authors provide an open-source analysis tool and dataset to enable reproducible, system-wide power optimization in heterogeneous HPC environments and outline avenues for extending the work to other systems and deeper domain analyses.

Abstract

As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-analysis approach using system data collected from the Polaris supercomputer at Argonne National Laboratory. We focus on GPU utilization and power demands, navigating the complexities of large-scale, heterogeneous datasets. Our approach, which incorporates data preprocessing, post-processing, and statistical methods, condenses the data volume by 94% while preserving essential insights. Through this analysis, we uncover key opportunities for power optimization, such as reducing high idle power costs, applying power strategies at the job-level, and aligning GPU power allocation with workload demands. Our findings provide actionable insights for energy-efficient computing and offer a practical, reproducible approach for applying existing research to optimize system performance.

Paper Structure

This paper contains 25 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Telemetry and job data collection flow on the Polaris supercomputer. Compute node GPU metrics are sampled by NVIDIA and HPCM services, and job event data are recorded on the scheduler node. These data are streamed via a Kafka bus for later en masse processing.
  • Figure 2: Our data co-analysis process: (i) Preprocessing stage (gray) and (ii) Post-processing stage (blue).
  • Figure 3: Resource imbalance categories showing resource demands on two nodes (blue and red). (Top) $RI_{temporal}$ captures intra-node variation---constant workloads exhibit no change, while stochastic workloads show significant fluctuations. (Bottom) $RI_{spatial}$ captures inter-node differences---constant workloads have uniform resource usage, while stochastic workloads display high node-to-node variance.
  • Figure 4: (Top) Resource decomposition by job class. (Bottom) Mean power draw by job class (see Table \ref{['queue-table']}) and number of GPUs used, showing the median, quartiles (Q1 and Q3) as box edges, minimum and maximum values as whisker endpoints, and outliers along an outer line.
  • Figure 5: Pearson’s coefficients for job metrics. Node hours, runtime, and energy show strong correlations (A), while GPU count, power, and energy correlate weakly (B).
  • ...and 9 more figures