Table of Contents
Fetching ...

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Milo Lurati, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven

TL;DR

This work introduces the first HIP auto-tuner by extending Kernel Tuner with PyHIP to run HIP kernels on both AMD and Nvidia GPUs. Through four tunable kernels across two AMD and two Nvidia devices, it shows that auto-tuning yields substantially larger performance gains on AMD (around 10x) than on Nvidia (around 2x), and that Nvidia-optimized configurations often fail to perform well on AMD. The study also reveals limited cross-vendor portability, with AMD-tuned configurations tending to generalize better to Nvidia than vice versa. Overall, the results argue for vendor-specific auto-tuning efforts when porting HIP codes and provide a production-ready HIP backend in Kernel Tuner to support broader, cross-vendor HPC workloads.

Abstract

Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD's HIP. We do so by extending Kernel Tuner, an open-source Python library for auto-tuning GPU programs. We analyze the performance impact and tuning difficulty for four highly-tunable benchmark kernels on four different GPUs: two from Nvidia and two from AMD. Our results demonstrate that auto-tuning has a significantly higher impact on performance on AMD compared to Nvidia (10x vs 2x). Additionally, we show that applications tuned for Nvidia do not perform optimally on AMD, underscoring the importance of auto-tuning specifically for AMD to achieve high performance on these GPUs.

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

TL;DR

This work introduces the first HIP auto-tuner by extending Kernel Tuner with PyHIP to run HIP kernels on both AMD and Nvidia GPUs. Through four tunable kernels across two AMD and two Nvidia devices, it shows that auto-tuning yields substantially larger performance gains on AMD (around 10x) than on Nvidia (around 2x), and that Nvidia-optimized configurations often fail to perform well on AMD. The study also reveals limited cross-vendor portability, with AMD-tuned configurations tending to generalize better to Nvidia than vice versa. Overall, the results argue for vendor-specific auto-tuning efforts when porting HIP codes and provide a production-ready HIP backend in Kernel Tuner to support broader, cross-vendor HPC workloads.

Abstract

Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD's HIP. We do so by extending Kernel Tuner, an open-source Python library for auto-tuning GPU programs. We analyze the performance impact and tuning difficulty for four highly-tunable benchmark kernels on four different GPUs: two from Nvidia and two from AMD. Our results demonstrate that auto-tuning has a significantly higher impact on performance on AMD compared to Nvidia (10x vs 2x). Additionally, we show that applications tuned for Nvidia do not perform optimally on AMD, underscoring the importance of auto-tuning specifically for AMD to achieve high performance on these GPUs.
Paper Structure (13 sections, 3 equations, 13 figures, 8 tables)

This paper contains 13 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Kernel Tuner software architecture.
  • Figure 2: Fitness Flow Graph of 2D Convolution search space for A4000.
  • Figure 3: 2D Convolution tuning search space.
  • Figure 4: 2D Convolution proportion of centrality.
  • Figure 5: Hotspot tuning search space.
  • ...and 8 more figures