Table of Contents
Fetching ...

An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

Corentin Ferry, Nicolas Derumigny, Steven Derrien, Sanjay Rajopadhye

TL;DR

The paper tackles memory bandwidth bottlenecks in FPGA accelerators by introducing an automatic HLS flow that jointly derives burst-friendly data layouts, data packing, and runtime compression. Central to the approach is the MARS layout, which partitions data into contiguous, irredundant blocks whose placement enables coalesced off-chip accesses and seamless compression without compromising execution. The workflow uses polyhedral analysis to identify contiguous data blocks, an ILP-based optimization to maximize contiguity across blocks, and a governance framework for packing and metadata to preserve alignment and decompression. Evaluation on FPGA hardware demonstrates up to $7\times$ reductions in I/O cycles, with compression gains up to $5.09\times$ for certain configurations, albeit with trade-offs in area and controllable by tile size and data type. The work thereby automates bandwidth-aware memory layout transformations and shows practical gains for bandwidth-bound stencil and Jacobi-like kernels in FPGA accelerators.

Abstract

Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to $7\times$ compared to un-optimized memory accesses.

An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators

TL;DR

The paper tackles memory bandwidth bottlenecks in FPGA accelerators by introducing an automatic HLS flow that jointly derives burst-friendly data layouts, data packing, and runtime compression. Central to the approach is the MARS layout, which partitions data into contiguous, irredundant blocks whose placement enables coalesced off-chip accesses and seamless compression without compromising execution. The workflow uses polyhedral analysis to identify contiguous data blocks, an ILP-based optimization to maximize contiguity across blocks, and a governance framework for packing and metadata to preserve alignment and decompression. Evaluation on FPGA hardware demonstrates up to reductions in I/O cycles, with compression gains up to for certain configurations, albeit with trade-offs in area and controllable by tile size and data type. The work thereby automates bandwidth-aware memory layout transformations and shows practical gains for bandwidth-bound stencil and Jacobi-like kernels in FPGA accelerators.

Abstract

Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to compared to un-optimized memory accesses.
Paper Structure (48 sections, 1 equation, 11 figures, 2 tables, 1 algorithm)

This paper contains 48 sections, 1 equation, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Domain of the Jacobi stencil divided into tiles of size $6 \times 6$. Each tile contains 18 $(t, i)$ points corresponding to 18 computations of $c_{t, i}$s.
  • Figure 2: Compiler flow (our contributions in green)
  • Figure 3: Inter-tile communication pattern for the Jacobi stencil: red arrows indicate data input into the tile shown in the center, and blue arrows indicate data output from this tile.
  • Figure 4: Macro-pipeline structure: read-execute-write. Our contribution focuses on the read and write stages.
  • Figure 5: Data packing and compression reduce storage and transfer redundancy at the expense of address alignment and, for compression, predictability of addresses.
  • ...and 6 more figures