Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Endri Taka; Dimitrios Gourounas; Andreas Gerstlauer; Diana Marculescu; Aman Arora

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora

TL;DR

This paper addresses GEMM acceleration on two AI-optimized FPGAs with distinct architectures: Versal ACAP (out-of-fabric AIE) and Stratix 10 NX (in-fabric TBs). It proposes architecture-aware, systematic frameworks for GEMM design, including multi-level tiling, AIE/TB mapping, memory strategies, and RTL automation, supported by extensive design-space exploration. The results show up to $77$ TOPs (int8) on Versal and $68$ TOPs (int8) on Stratix, with energy efficiencies up to $0.94$ and $1.35$ TOPs/W respectively, illustrating strong on-chip data reuse and platform-specific bottlenecks. The study delivers actionable guidelines on memory mapping, dataflow, and programmability trade-offs, delivering practical impact for deploying GEMM-based DL workloads on these leading FPGA platforms.

Abstract

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX, employ significantly different architectural approaches. This paper presents novel systematic frameworks to optimize the performance of General Matrix Multiplication (GEMM), a fundamental operation in DL workloads, by exploiting the unique and distinct architectural characteristics of each FPGA. Our evaluation on GEMM workloads for int8 precision shows up to 77 and 68 TOPs (int8) throughput, with up to 0.94 and 1.35 TOPs/W energy efficiency for Versal VC1902 and Stratix 10 NX, respectively. This work provides insights and guidelines for optimizing GEMM-based applications on both platforms, while also delving into their programmability trade-offs and associated challenges.

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

TL;DR

TOPs (int8) on Versal and

TOPs (int8) on Stratix, with energy efficiencies up to

and

TOPs/W respectively, illustrating strong on-chip data reuse and platform-specific bottlenecks. The study delivers actionable guidelines on memory mapping, dataflow, and programmability trade-offs, delivering practical impact for deploying GEMM-based DL workloads on these leading FPGA platforms.

Abstract

Paper Structure (35 sections, 10 equations, 8 figures, 4 tables)

This paper contains 35 sections, 10 equations, 8 figures, 4 tables.

Introduction
Related Work
FPGA Architectures Overview
Versal ACAP Architecture
Stratix 10 NX Architecture
GEMM Design & Optimization
GEMM Implementation on Versal ACAP
GEMM Multi-Level Tiling Scheme
GEMM Mapping on AIE Array
PL Implementation
Memory Optimization Strategy
GEMM Implementation on Stratix 10 NX
TB Layout
Parameter TBlen
Parameter Kp
...and 20 more sections

Figures (8)

Figure 1: Versal ACAP architecture.
Figure 2: Architecture of Stratix 10 NX Tensor Blocks.
Figure 3: Multi-level tiling scheme for GEMM on Versal ACAP.
Figure 4: GEMM accelerator design on Versal AIE and PL.
Figure 5: BRAM configurations example and proposed modeling.
...and 3 more figures

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

TL;DR

Abstract

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)