Table of Contents
Fetching ...

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations

Anastasios Petropoulos, Theodore Antonakopoulos

TL;DR

The paper tackles the challenge of scalable, high-throughput DNN inference on FPGA platforms by introducing a dynamically configurable accelerator built around a systolic array (SA) processing unit that leverages high-bandwidth memory (HBM) and UltraRAM (URAM). It proposes two PU configurations to balance compute resources and presents a two-phase weight-transfer scheduling strategy to minimize on-chip stalls during GEMM/Conv workloads, enabling efficient multi-PU operation on an Alveo FPGA. The approach yields strong performance on ResNet-18/50, with notable throughput and energy efficiency gains over prior works, and demonstrates the potential to extend the architecture to analog in-memory computing (AIMC) emulation using a Noise Injection Unit (NIU) while preserving the same memory interfaces. This work provides a versatile, transferable FPGA accelerator platform and a practical testbed for investigating AIMC integration in future heterogeneous chips.

Abstract

Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays, high-bandwidth memory, and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations

TL;DR

The paper tackles the challenge of scalable, high-throughput DNN inference on FPGA platforms by introducing a dynamically configurable accelerator built around a systolic array (SA) processing unit that leverages high-bandwidth memory (HBM) and UltraRAM (URAM). It proposes two PU configurations to balance compute resources and presents a two-phase weight-transfer scheduling strategy to minimize on-chip stalls during GEMM/Conv workloads, enabling efficient multi-PU operation on an Alveo FPGA. The approach yields strong performance on ResNet-18/50, with notable throughput and energy efficiency gains over prior works, and demonstrates the potential to extend the architecture to analog in-memory computing (AIMC) emulation using a Noise Injection Unit (NIU) while preserving the same memory interfaces. This work provides a versatile, transferable FPGA accelerator platform and a practical testbed for investigating AIMC integration in future heterogeneous chips.

Abstract

Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays, high-bandwidth memory, and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.

Paper Structure

This paper contains 9 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The system architecture of multiple PUs on an Alveo U50 FPGA.
  • Figure 2: The processing unit (PU) architecture: (a) the pre-processing block, (b) the systolic array, and (c) the post-processing block.
  • Figure 3: The transformation of a convolutional layer to matrix multiplication and the computational dataflow pseudocode.
  • Figure 4: Example of two-phase scheduling: (a) baseline, and (b) adaptive.
  • Figure 5: (a) ResNet-50 individual layers latencies for both PU configurations. Two-phase method (b) time and (c) memory ratios for ResNet-18 on $\text{PU}_{2\mathrm{x}}$.