An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

Corentin Ferry; Nicolas Derumigny; Steven Derrien; Sanjay Rajopadhye

An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

Corentin Ferry, Nicolas Derumigny, Steven Derrien, Sanjay Rajopadhye

TL;DR

The paper tackles memory bandwidth bottlenecks in FPGA accelerators by introducing an automatic HLS flow that jointly derives burst-friendly data layouts, data packing, and runtime compression. Central to the approach is the MARS layout, which partitions data into contiguous, irredundant blocks whose placement enables coalesced off-chip accesses and seamless compression without compromising execution. The workflow uses polyhedral analysis to identify contiguous data blocks, an ILP-based optimization to maximize contiguity across blocks, and a governance framework for packing and metadata to preserve alignment and decompression. Evaluation on FPGA hardware demonstrates up to $7\times$ reductions in I/O cycles, with compression gains up to $5.09\times$ for certain configurations, albeit with trade-offs in area and controllable by tile size and data type. The work thereby automates bandwidth-aware memory layout transformations and shows practical gains for bandwidth-bound stencil and Jacobi-like kernels in FPGA accelerators.

Abstract

Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to $7\times$ compared to un-optimized memory accesses.

An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

TL;DR

reductions in I/O cycles, with compression gains up to

for certain configurations, albeit with trade-offs in area and controllable by tile size and data type. The work thereby automates bandwidth-aware memory layout transformations and shows practical gains for bandwidth-bound stencil and Jacobi-like kernels in FPGA accelerators.

Abstract

compared to un-optimized memory accesses.

Paper Structure (48 sections, 1 equation, 11 figures, 2 tables, 1 algorithm)

This paper contains 48 sections, 1 equation, 11 figures, 2 tables, 1 algorithm.

Introduction
Background
Locality optimizations
Illustrative example: 1D Jacobi stencil
Deriving parallel accelerators using HLS
Padding vs packing
Runtime data compression
Memory Layout Optimization
Extracting Contiguous Data Blocks
Enabling Coalesced Accesses across Contiguous Data Blocks
Properties of the layout
Contiguous tile-level allocation
Irredundancy of storage
Example
General case
...and 33 more sections

Figures (11)

Figure 1: Domain of the Jacobi stencil divided into tiles of size $6 \times 6$. Each tile contains 18 $(t, i)$ points corresponding to 18 computations of $c_{t, i}$s.
Figure 2: Compiler flow (our contributions in green)
Figure 3: Inter-tile communication pattern for the Jacobi stencil: red arrows indicate data input into the tile shown in the center, and blue arrows indicate data output from this tile.
Figure 4: Macro-pipeline structure: read-execute-write. Our contribution focuses on the read and write stages.
Figure 5: Data packing and compression reduce storage and transfer redundancy at the expense of address alignment and, for compression, predictability of addresses.
...and 6 more figures

An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

TL;DR

Abstract

An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

Authors

TL;DR

Abstract

Table of Contents

Figures (11)