Table of Contents
Fetching ...

Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters

Rutwik Jain, Yiwei Jiang, Matthew D. Sinclair, Shivaraman Venkataraman

Abstract

As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.

Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters

Abstract

As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.

Paper Structure

This paper contains 36 sections, 3 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Time series plots showing power behavior for LLaMA3-8B inference and LSMS over two iterations.
  • Figure 2: Cumulative power spike distribution for LLaMA3-8B inference (left) and histogram showing fraction of spikes if binning is performed with bin size = 0.1 (right), along with the resultant power spike vector v.
  • Figure 3: Dendrogram based on power spike distributions of workloads. We label the clusters as Low-spike (orange), High-spike (green), and Mixed (red), respectively, based on their power distribution.
  • Figure 4: K-Means Clustering on memory and compute utilization showing workloads grouped as C (compute-intensive), M (memory-intensive) and H (hybrid).
  • Figure 5: Cumulative power distributions showing power spikes for three categories of workloads
  • ...and 7 more figures