Table of Contents
Fetching ...

IMAGine: An In-Memory Accelerated GEMV Engine Overlay

MD Arafat Kabir, Tendayi Kamucheka, Nathaniel Fredricks, Joel Mandebi, Jason Bakos, Miaoqing Huang, David Andrews

TL;DR

IMAGine tackles the memory bottleneck in BRAM-dense FPGA workloads by delivering an in-memory GEMV engine overlay that clocks at BRAM Fmax and scales linearly with BRAM capacity. The architecture uses a 2D GEMV tile array with PiCaSO-IM PIM modules to maximize BRAM-speed operation, achieving $64K$ PEs on an AMD Alveo $U55$ at $737$ MHz and outperforming prior PIM GEMV designs and competing ASICs in clock rate while maintaining high MAC density. Comparative analyses show IMAGine runs $2.65\times$–$3.2\times$ faster in system frequency than existing FPGA PIM GEMV engines, and its linear BRAM scalability enables full BRAM utilization without DSPs. The work challenges the notion that FPGA overlays cannot reach BRAM-Fmax, delivers open-source implementations, and outlines future MLIR-based compiler support for hardware/software co-design.

Abstract

Processor-in-Memory (PIM) overlays and new redesigned reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of MAC units.

IMAGine: An In-Memory Accelerated GEMV Engine Overlay

TL;DR

IMAGine tackles the memory bottleneck in BRAM-dense FPGA workloads by delivering an in-memory GEMV engine overlay that clocks at BRAM Fmax and scales linearly with BRAM capacity. The architecture uses a 2D GEMV tile array with PiCaSO-IM PIM modules to maximize BRAM-speed operation, achieving PEs on an AMD Alveo at MHz and outperforming prior PIM GEMV designs and competing ASICs in clock rate while maintaining high MAC density. Comparative analyses show IMAGine runs faster in system frequency than existing FPGA PIM GEMV engines, and its linear BRAM scalability enables full BRAM utilization without DSPs. The work challenges the notion that FPGA overlays cannot reach BRAM-Fmax, delivers open-source implementations, and outlines future MLIR-based compiler support for hardware/software co-design.

Abstract

Processor-in-Memory (PIM) overlays and new redesigned reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of MAC units.
Paper Structure (19 sections, 6 figures, 5 tables)

This paper contains 19 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Ideal scaling vs. actual TOPS of RIMA on Stratix 10 GX2800
  • Figure 2: System architecture of IMAGine illustrating the data and instruction flow (a) through the GEMV engine and (b) within GEMV tiles.
  • Figure 3: Architectures of (a) GEMV controller and (b) PiCaSO-IM, the adapted version of PiCaSO-F picaso2023.
  • Figure 4: Resource usage of IMAGine on representatives of Virtex-7 and Ultrascale+ families utilizing 100% BRAMs as PIM overlays.
  • Figure 5: Avoiding unnecessary hard-block (CMAC) crossing by floorplanning (a) placement and net connections before floorplanning, (b) floorplan localizing logic and routing, (b) placement and net connections in the final design.
  • ...and 1 more figures