Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts
Florian Mayer, Julian Brandner, Michael Philippsen
TL;DR
This work presents a polyhedral-model–driven approach to accelerate stencil codes on FPGAs by generating stencil-specific cache structures and exploiting wide data bursts. By tiling the iteration space, selecting permutation and delta shifts, and inlining cache-like buffers, the method enables efficient hardware pipelining and data reuse. The key contributions include a taxonomy of cache buffers (full, chunk, line), a fusion strategy for temporal locality, and a five-step code-generation workflow that yields substantial runtime improvements over baseline HLS. The practical impact is demonstrated with large speedups across multiple stencil benchmarks, highlighting the importance of tile sizing, data-shipment bandwidth, and padding to maximize FPGA throughput.
Abstract
It is well known that to accelerate stencil codes on CPUs or GPUs and to exploit hardware caches and their lines optimizers must find spatial and temporal locality of array accesses to harvest data-reuse opportunities. On FPGAs there is the burden that there are no built-in caches (or only pre-built hardware descriptions for cache blocks that are inefficient for stencil codes). But this paper demonstrates that this lack is also a chance as polyhedral methods can be used to generate stencil-specific cache-structures of the right sizes on the FPGA and to fill and flush them efficiently with wide bursts during stencil execution. The paper shows how to derive the appropriate directives and code restructurings from stencil codes so that the FPGA compiler generates fast stencil hardware. Switching on our optimization improves the runtime of a set of 10 stencils by between 43x and 156x.
