Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs
Melanie Cornelius, Greg Cross, Shilpika Shilpika, Matthew T. Dearing, Zhiling Lan
TL;DR
This work tackles the challenge of understanding GPU power consumption in production HPC workloads by co-analyzing diverse system logs from the Polaris supercomputer. It introduces a data preprocessing and post-processing workflow that unifies heterogeneous telemetry and scheduler data, yielding actionable insights while reducing data volume by 94%. The study reveals that GPU power is heavily dominated by idle power, memory utilization is generally low, and memory allocation has limited direct impact on power, motivating practical opportunities such as idle-power reduction and job-level power management, supported by RI-based metrics for variability. The authors provide an open-source analysis tool and dataset to enable reproducible, system-wide power optimization in heterogeneous HPC environments and outline avenues for extending the work to other systems and deeper domain analyses.
Abstract
As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-analysis approach using system data collected from the Polaris supercomputer at Argonne National Laboratory. We focus on GPU utilization and power demands, navigating the complexities of large-scale, heterogeneous datasets. Our approach, which incorporates data preprocessing, post-processing, and statistical methods, condenses the data volume by 94% while preserving essential insights. Through this analysis, we uncover key opportunities for power optimization, such as reducing high idle power costs, applying power strategies at the job-level, and aligning GPU power allocation with workload demands. Our findings provide actionable insights for energy-efficient computing and offer a practical, reproducible approach for applying existing research to optimize system performance.
