Table of Contents
Fetching ...

Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations

Tanzima Z. Islam, Aniruddha Marathe, Holland Schutte, Mohammad Zaeed

TL;DR

This work addresses the challenge of underutilized GPU hardware in heterogeneous HPC systems by proposing a data-driven, multi-objective framework that links common GPU optimizations to hardware resource usage. It introduces a composite score that combines execution time and SM utilization, and employs sparse coding and a Resource Significance Measure to identify which hardware resources most influence performance. The methodology is applied to Matmult and several ECP proxy applications (Pennant, MiniFE, MixBench), using NVML/CUPTI data collected with the dashing framework on NVIDIA Volta GPUs, and demonstrates optimization opportunities that improve execution time up to $29.6\%$, SM utilization up to $5.4\%$, and power consumption by up to $26.5\%$. These results provide a systematic, interpretable pathway from hardware-resource usage to code transformations, enabling more effective auto-tuning and co-design of performance requirements for GPU-accelerated science.

Abstract

With heterogeneous systems, the number of GPUs per chip increases to provide computational capabilities for solving science at a nanoscopic scale. However, low utilization for single GPUs defies the need to invest more money for expensive ccelerators. While related work develops optimizations for improving application performance, none studies how these optimizations impact hardware resource usage or the average GPU utilization. This paper takes a data-driven analysis approach in addressing this gap by (1) characterizing how hardware resource usage affects device utilization, execution time, or both, (2) presenting a multi-objective metric to identify important application-device interactions that can be optimized to improve device utilization and application performance jointly, (3) studying hardware resource usage behaviors of several optimizations for a benchmark application, and finally (4) identifying optimization opportunities for several scientific proxy applications based on their hardware resource usage behaviors. Furthermore, we demonstrate the applicability of our methodology by applying the identified optimizations to a proxy application, which improves the execution time, device utilization and power consumption by up to 29.6%, 5.3% and 26.5% respectively.

Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations

TL;DR

This work addresses the challenge of underutilized GPU hardware in heterogeneous HPC systems by proposing a data-driven, multi-objective framework that links common GPU optimizations to hardware resource usage. It introduces a composite score that combines execution time and SM utilization, and employs sparse coding and a Resource Significance Measure to identify which hardware resources most influence performance. The methodology is applied to Matmult and several ECP proxy applications (Pennant, MiniFE, MixBench), using NVML/CUPTI data collected with the dashing framework on NVIDIA Volta GPUs, and demonstrates optimization opportunities that improve execution time up to , SM utilization up to , and power consumption by up to . These results provide a systematic, interpretable pathway from hardware-resource usage to code transformations, enabling more effective auto-tuning and co-design of performance requirements for GPU-accelerated science.

Abstract

With heterogeneous systems, the number of GPUs per chip increases to provide computational capabilities for solving science at a nanoscopic scale. However, low utilization for single GPUs defies the need to invest more money for expensive ccelerators. While related work develops optimizations for improving application performance, none studies how these optimizations impact hardware resource usage or the average GPU utilization. This paper takes a data-driven analysis approach in addressing this gap by (1) characterizing how hardware resource usage affects device utilization, execution time, or both, (2) presenting a multi-objective metric to identify important application-device interactions that can be optimized to improve device utilization and application performance jointly, (3) studying hardware resource usage behaviors of several optimizations for a benchmark application, and finally (4) identifying optimization opportunities for several scientific proxy applications based on their hardware resource usage behaviors. Furthermore, we demonstrate the applicability of our methodology by applying the identified optimizations to a proxy application, which improves the execution time, device utilization and power consumption by up to 29.6%, 5.3% and 26.5% respectively.
Paper Structure (25 sections, 4 equations, 6 figures, 4 tables)

This paper contains 25 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Abstract machine model of the NVIDIA Volta architecture.
  • Figure 2: Setting up the dashing framework.
  • Figure 3: Hardware resource usage behaviors that explain different $target$ metrics for various Matmult kernels.
  • Figure 4: Execution times of all seven Matmult versions.
  • Figure 5: Hardware resource usage behaviors that explain the $target \rightarrow \texttt{score}\xspace$ of two ECP applications and a benchmark. Based on their resource usage, we propose and implement several optimizations that improve the performance of the Main3 kernel in Pennant up to 29.6% and SM utilization up to 5.4%.
  • ...and 1 more figures