Table of Contents
Fetching ...

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

Xiaoyu Chu, Daniel Hofstätter, Shashikant Ilager, Sacheendra Talluri, Duncan Kampert, Damian Podareanu, Dmitry Duplyakin, Ivona Brandic, Alexandru Iosup

TL;DR

The paper addresses how ML workloads affect HPC datacenter operation compared with generic workloads by leveraging long-term, open traces from a national-scale facility. It proposes a data-driven framework to jointly analyze node and job data across energy, utilization, and failure dimensions, supported by three linked datasets and open-source analysis tools. Key findings reveal that ML workloads disproportionately consume energy, cause frequent GPU thermal stress, exhibit longer runtimes, and display diurnal failure patterns, with a substantial share of energy wasted on unsuccessful terminations. These insights offer actionable guidance for energy-aware scheduling, checkpointing, and topology-aware resource allocation, and establish a foundation for reproducible, data-driven HPC workload research.

Abstract

HPC datacenters offer a backbone to the modern digital society. Increasingly, they run Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting science, business, and other decision-making processes. However, understanding how ML jobs impact the operation of HPC datacenters, relative to generic jobs, remains desirable but understudied. In this work, we leverage long-term operational data, collected from a national-scale production HPC datacenter, and statistically compare how ML and generic jobs can impact the performance, failures, resource utilization, and energy consumption of HPC datacenters. Our study provides key insights, e.g., ML-related power usage causes GPU nodes to run into temperature limitations, median/mean runtime and failure rates are higher for ML jobs than for generic jobs, both ML and generic jobs exhibit highly variable arrival processes and resource demands, significant amounts of energy are spent on unsuccessfully terminating jobs, and concurrent jobs tend to terminate in the same state. We open-source our cleaned-up data traces on Zenodo (https://doi.org/10.5281/zenodo.13685426), and provide our analysis toolkit as software hosted on GitHub (https://github.com/atlarge-research/2024-icpads-hpc-workload-characterization). This study offers multiple benefits for data center administrators, who can improve operational efficiency, and for researchers, who can further improve system designs, scheduling techniques, etc.

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

TL;DR

The paper addresses how ML workloads affect HPC datacenter operation compared with generic workloads by leveraging long-term, open traces from a national-scale facility. It proposes a data-driven framework to jointly analyze node and job data across energy, utilization, and failure dimensions, supported by three linked datasets and open-source analysis tools. Key findings reveal that ML workloads disproportionately consume energy, cause frequent GPU thermal stress, exhibit longer runtimes, and display diurnal failure patterns, with a substantial share of energy wasted on unsuccessful terminations. These insights offer actionable guidance for energy-aware scheduling, checkpointing, and topology-aware resource allocation, and establish a foundation for reproducible, data-driven HPC workload research.

Abstract

HPC datacenters offer a backbone to the modern digital society. Increasingly, they run Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting science, business, and other decision-making processes. However, understanding how ML jobs impact the operation of HPC datacenters, relative to generic jobs, remains desirable but understudied. In this work, we leverage long-term operational data, collected from a national-scale production HPC datacenter, and statistically compare how ML and generic jobs can impact the performance, failures, resource utilization, and energy consumption of HPC datacenters. Our study provides key insights, e.g., ML-related power usage causes GPU nodes to run into temperature limitations, median/mean runtime and failure rates are higher for ML jobs than for generic jobs, both ML and generic jobs exhibit highly variable arrival processes and resource demands, significant amounts of energy are spent on unsuccessfully terminating jobs, and concurrent jobs tend to terminate in the same state. We open-source our cleaned-up data traces on Zenodo (https://doi.org/10.5281/zenodo.13685426), and provide our analysis toolkit as software hosted on GitHub (https://github.com/atlarge-research/2024-icpads-hpc-workload-characterization). This study offers multiple benefits for data center administrators, who can improve operational efficiency, and for researchers, who can further improve system designs, scheduling techniques, etc.
Paper Structure (19 sections, 10 figures, 6 tables)

This paper contains 19 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Generic vs. ML hardware and workload, summary. Energy demands of ML jobs are proportionally higher than their share of submissions and runtime.
  • Figure 2: An example of the data integration process. We match each job record to the fine granular 30s-interval timestamps of the node dataset.
  • Figure 3: Normalized node utilization across various metrics is depicted using probability density functions (left) and box plots (right), revealing high GPU temperatures.
  • Figure 4: Average GPU temperature at various power utilizations across GPU indices (0 to 3) in the node. For the same power usage, GPU temperatures vary greatly.
  • Figure 5: The total number of submitted generic jobs and ML jobs, showing high variability over time.
  • ...and 5 more figures