An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators
Corentin Ferry, Nicolas Derumigny, Steven Derrien, Sanjay Rajopadhye
TL;DR
The paper tackles memory bandwidth bottlenecks in FPGA accelerators by introducing an automatic HLS flow that jointly derives burst-friendly data layouts, data packing, and runtime compression. Central to the approach is the MARS layout, which partitions data into contiguous, irredundant blocks whose placement enables coalesced off-chip accesses and seamless compression without compromising execution. The workflow uses polyhedral analysis to identify contiguous data blocks, an ILP-based optimization to maximize contiguity across blocks, and a governance framework for packing and metadata to preserve alignment and decompression. Evaluation on FPGA hardware demonstrates up to $7\times$ reductions in I/O cycles, with compression gains up to $5.09\times$ for certain configurations, albeit with trade-offs in area and controllable by tile size and data type. The work thereby automates bandwidth-aware memory layout transformations and shows practical gains for bandwidth-bound stencil and Jacobi-like kernels in FPGA accelerators.
Abstract
Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to $7\times$ compared to un-optimized memory accesses.
