A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations
Anastasios Petropoulos, Theodore Antonakopoulos
TL;DR
The paper tackles the challenge of scalable, high-throughput DNN inference on FPGA platforms by introducing a dynamically configurable accelerator built around a systolic array (SA) processing unit that leverages high-bandwidth memory (HBM) and UltraRAM (URAM). It proposes two PU configurations to balance compute resources and presents a two-phase weight-transfer scheduling strategy to minimize on-chip stalls during GEMM/Conv workloads, enabling efficient multi-PU operation on an Alveo FPGA. The approach yields strong performance on ResNet-18/50, with notable throughput and energy efficiency gains over prior works, and demonstrates the potential to extend the architecture to analog in-memory computing (AIMC) emulation using a Noise Injection Unit (NIU) while preserving the same memory interfaces. This work provides a versatile, transferable FPGA accelerator platform and a practical testbed for investigating AIMC integration in future heterogeneous chips.
Abstract
Deep neural network (DNN) inference relies increasingly on specialized hardware for high computational efficiency. This work introduces a field-programmable gate array (FPGA)-based dynamically configurable accelerator featuring systolic arrays, high-bandwidth memory, and UltraRAMs. We present two processing unit (PU) configurations with different computing capabilities using the same interfaces and peripheral blocks. By instantiating multiple PUs and employing a heuristic weight transfer schedule, the architecture achieves notable throughput efficiency over prior works. Moreover, we outline how the architecture can be extended to emulate analog in-memory computing (AIMC) devices to aid next-generation heterogeneous AIMC chip designs and investigate device-level noise behavior. Overall, this brief presents a versatile DNN inference acceleration architecture adaptable to various models and future FPGA designs.
