Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

Florian Mayer; Julian Brandner; Michael Philippsen

Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

Florian Mayer, Julian Brandner, Michael Philippsen

TL;DR

This work presents a polyhedral-model–driven approach to accelerate stencil codes on FPGAs by generating stencil-specific cache structures and exploiting wide data bursts. By tiling the iteration space, selecting permutation and delta shifts, and inlining cache-like buffers, the method enables efficient hardware pipelining and data reuse. The key contributions include a taxonomy of cache buffers (full, chunk, line), a fusion strategy for temporal locality, and a five-step code-generation workflow that yields substantial runtime improvements over baseline HLS. The practical impact is demonstrated with large speedups across multiple stencil benchmarks, highlighting the importance of tile sizing, data-shipment bandwidth, and padding to maximize FPGA throughput.

Abstract

It is well known that to accelerate stencil codes on CPUs or GPUs and to exploit hardware caches and their lines optimizers must find spatial and temporal locality of array accesses to harvest data-reuse opportunities. On FPGAs there is the burden that there are no built-in caches (or only pre-built hardware descriptions for cache blocks that are inefficient for stencil codes). But this paper demonstrates that this lack is also a chance as polyhedral methods can be used to generate stencil-specific cache-structures of the right sizes on the FPGA and to fill and flush them efficiently with wide bursts during stencil execution. The paper shows how to derive the appropriate directives and code restructurings from stencil codes so that the FPGA compiler generates fast stencil hardware. Switching on our optimization improves the runtime of a set of 10 stencils by between 43x and 156x.

Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 8 figures, 1 table)

This paper contains 18 sections, 3 equations, 8 figures, 1 table.

Introduction
Notation for Tiling Transformations
Cache buffers: Types and Fusion
Fusion of Caches for Temporal Locality.
Optimization Method
Step 1 -- Picking the tile sizes SZ
Step 2 -- Picking a tiling permutation with the smallest cache buffers
Step 3 -- Picking a delta and an iteration padding
Step 4 -- Index redirection
Step 5 -- Code generation with declarations, pragmas, data shipment, halos, and bursts
Data shipment.
Halos and bursts.
Related Work
Evaluation
Impact of the tile sizes.
...and 3 more sections

Figures (8)

Figure 1: Loop Tiling Example.
Figure 2: Running Example.
Figure 3: Normal form of tiling the code in Fig. \ref{['fig:origStencil']} with the transformation $T(SZ=(SZ_i, SZ_j, SZ_k), p=(i,k,j), \delta=(0,0,0,0,0,0))$. S is a shorthand of the original loop body.
Figure 4: Fused working sets and selected cache buffers (types and sizes) for two out of six feasible permutations for the running example with tile sizes $SZ=(32,32,32)$.
Figure 5: Generated code for the running example in Fig. \ref{['fig:origStencil']} for $T$($SZ$=$(32,32,32)$, $p$=$(i,j,k)$, $\delta$=$(-1,-ti_1,\dots,-1,-ti_d))$ with shared buffers V'=[32,32,32], A'=[2,33,33] from Fig. \ref{['fig:graphs']}.
...and 3 more figures

Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

TL;DR

Abstract

Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

Authors

TL;DR

Abstract

Table of Contents

Figures (8)