Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations
Tanzima Z. Islam, Aniruddha Marathe, Holland Schutte, Mohammad Zaeed
TL;DR
This work addresses the challenge of underutilized GPU hardware in heterogeneous HPC systems by proposing a data-driven, multi-objective framework that links common GPU optimizations to hardware resource usage. It introduces a composite score that combines execution time and SM utilization, and employs sparse coding and a Resource Significance Measure to identify which hardware resources most influence performance. The methodology is applied to Matmult and several ECP proxy applications (Pennant, MiniFE, MixBench), using NVML/CUPTI data collected with the dashing framework on NVIDIA Volta GPUs, and demonstrates optimization opportunities that improve execution time up to $29.6\%$, SM utilization up to $5.4\%$, and power consumption by up to $26.5\%$. These results provide a systematic, interpretable pathway from hardware-resource usage to code transformations, enabling more effective auto-tuning and co-design of performance requirements for GPU-accelerated science.
Abstract
With heterogeneous systems, the number of GPUs per chip increases to provide computational capabilities for solving science at a nanoscopic scale. However, low utilization for single GPUs defies the need to invest more money for expensive ccelerators. While related work develops optimizations for improving application performance, none studies how these optimizations impact hardware resource usage or the average GPU utilization. This paper takes a data-driven analysis approach in addressing this gap by (1) characterizing how hardware resource usage affects device utilization, execution time, or both, (2) presenting a multi-objective metric to identify important application-device interactions that can be optimized to improve device utilization and application performance jointly, (3) studying hardware resource usage behaviors of several optimizations for a benchmark application, and finally (4) identifying optimization opportunities for several scientific proxy applications based on their hardware resource usage behaviors. Furthermore, we demonstrate the applicability of our methodology by applying the identified optimizations to a proxy application, which improves the execution time, device utilization and power consumption by up to 29.6%, 5.3% and 26.5% respectively.
