IMAGine: An In-Memory Accelerated GEMV Engine Overlay
MD Arafat Kabir, Tendayi Kamucheka, Nathaniel Fredricks, Joel Mandebi, Jason Bakos, Miaoqing Huang, David Andrews
TL;DR
IMAGine tackles the memory bottleneck in BRAM-dense FPGA workloads by delivering an in-memory GEMV engine overlay that clocks at BRAM Fmax and scales linearly with BRAM capacity. The architecture uses a 2D GEMV tile array with PiCaSO-IM PIM modules to maximize BRAM-speed operation, achieving $64K$ PEs on an AMD Alveo $U55$ at $737$ MHz and outperforming prior PIM GEMV designs and competing ASICs in clock rate while maintaining high MAC density. Comparative analyses show IMAGine runs $2.65\times$–$3.2\times$ faster in system frequency than existing FPGA PIM GEMV engines, and its linear BRAM scalability enables full BRAM utilization without DSPs. The work challenges the notion that FPGA overlays cannot reach BRAM-Fmax, delivers open-source implementations, and outlines future MLIR-based compiler support for hardware/software co-design.
Abstract
Processor-in-Memory (PIM) overlays and new redesigned reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of MAC units.
