Table of Contents
Fetching ...

Opening the Black Box: Performance Estimation during Code Generation for GPUs

Dominik Ernst, Georg Hager, Markus Holzer, Matthias Knorr, Gerhard Wellein

TL;DR

This paper tackles the problem of efficiently selecting code-generation configurations for GPU kernels by replacing opaque performance observations with a fast, memory-hierarchy–aware performance estimator. It extends the roofline model with cache-bandwidth limiters and augments it with a data-volume metric that derives from high-level address expressions produced by code generators. The combined approach enables quick ranking of configurations for memory-intensive GPU kernels, demonstrated on long-range 3D stencil and Lattice Boltzmann method applications within pystencils and lbmpy frameworks. While effective at identifying the general class of high-performing configurations, the method highlights that modeling of all performance-relevant mechanisms remains challenging, guiding future improvements such as TLB misses and broader hardware applicability. The work provides a practical pathway to accelerate hardware-aware code generation and architectural exploration without executing generated code on target devices.

Abstract

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

Opening the Black Box: Performance Estimation during Code Generation for GPUs

TL;DR

This paper tackles the problem of efficiently selecting code-generation configurations for GPU kernels by replacing opaque performance observations with a fast, memory-hierarchy–aware performance estimator. It extends the roofline model with cache-bandwidth limiters and augments it with a data-volume metric that derives from high-level address expressions produced by code generators. The combined approach enables quick ranking of configurations for memory-intensive GPU kernels, demonstrated on long-range 3D stencil and Lattice Boltzmann method applications within pystencils and lbmpy frameworks. While effective at identifying the general class of high-performing configurations, the method highlights that modeling of all performance-relevant mechanisms remains challenging, guiding future improvements such as TLB misses and broader hardware applicability. The work provides a practical pathway to accelerate hardware-aware code generation and architectural exploration without executing generated code on target devices.

Abstract

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

Paper Structure

This paper contains 27 sections, 9 equations, 20 figures.

Figures (20)

  • Figure 1: Illustration of cache bank conflicts
  • Figure 2: Illustration of in- and outgoing data volumes
  • Figure 3: Illustration of the memory footprint computation. Example for a 2D 4pt stencil update and a $2\times2$ thread block.
  • Figure 4: Simplified representation of the code to compute the unique data footprint using a generic grid iteration visitor pattern. The example shows the L2 load block footprint computation.
  • Figure 5: Estimated cycles for one lattice update of a 32 wide warp
  • ...and 15 more figures