Table of Contents
Fetching ...

Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Giuseppe M. Sarda, Nimish Shah, Debjyoti Bhattacharjee, Peter Debacker, Marian Verhelst

TL;DR

The paper addresses the limited understanding of GPGPU bottlenecks caused by proprietary tooling by leveraging an open-source Vortex RISCV-based GPGPU to perform micro-architectural parameter analysis. It introduces a trace-driven, hardware-aware runtime OpenCL kernel mapping approach and derives a runtime rule $lws = \frac{gws}{hp}$ with $hp = cores \times warps \times threads$ to adapt mappings without programmer input. Key contributions include a trace analysis framework for Vortex, a hardware-aware OpenCL runtime mapping method, and validation across diverse math kernels and ML layers showing significant performance gains and reduced execution variability. The work demonstrates a practical path to co-optimize software and hardware on open GPUs, enhancing portability and efficiency for data-parallel workloads on open-source platforms.

Abstract

GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.

Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

TL;DR

The paper addresses the limited understanding of GPGPU bottlenecks caused by proprietary tooling by leveraging an open-source Vortex RISCV-based GPGPU to perform micro-architectural parameter analysis. It introduces a trace-driven, hardware-aware runtime OpenCL kernel mapping approach and derives a runtime rule with to adapt mappings without programmer input. Key contributions include a trace analysis framework for Vortex, a hardware-aware OpenCL runtime mapping method, and validation across diverse math kernels and ML layers showing significant performance gains and reduced execution variability. The work demonstrates a practical path to co-optimize software and hardware on open GPUs, enhancing portability and efficiency for data-parallel workloads on open-source platforms.

Abstract

GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.
Paper Structure (4 sections, 1 equation, 2 figures)

This paper contains 4 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Execution traces of the vecadd kernel under 4 different lws. Each plot shows tagged instruction wavefronts, the PC, the active thread mask and the timestamp of instruction issues from different warps.
  • Figure 2: Violin plots showing the comparison (ratio) of latencies from our methodology vs fixed (lws=32, right in blue) and naive mapping (lws=1, left in yellow) on 450 different HW architectural configurations. Data tables show the average, the worst result, and the result count <1 (x/450) in percentage.