Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Jie Lei; Enrique S. Quintana-Ortí

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Jie Lei, Enrique S. Quintana-Ortí

TL;DR

The paper tackles efficient GEMM on the Versal ACAP for DL workloads by formulating $C \mathrel{+=} AB$ with $A \in \mathbb{R}^{m\times k}$, $B \in \mathbb{R}^{k\times n}$, and $C \in \mathbb{R}^{m\times n}$, and porting CPU-oriented GEMM techniques to Versal’s heterogeneous memory and AIEs. It introduces a memory-aware design that maps operands across a five-level memory hierarchy, and a architecture-specific UINT8 SIMD micro-kernel to exploit the Versal AIEs, while parallelizing the iteration space over multiple tiles. The approach demonstrates scalable performance up to 32 AIEs, achieving about 31.5 MACs/cycle on a single AIE and roughly 29.8 MACs/cycle with 32 AIEs, but identifies FPGA Ultra RAM bandwidth as a primary bottleneck in this setup. Overall, the work provides a concrete pathway for deploying high-performance GEMM on Versal ACAP and highlights data movement and memory bandwidth as critical levers for achieving higher DL throughput on this platform.

Abstract

This paper investigates the design of parallel general matrix multiplication (GEMM) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to port standard optimization techniques applied in the high-performance realization of GEMM on CPUs to the Versal ACAP. In particular, 1) we address the flexible exploitation of the Versal ACA multi-level memory hierarchy; 2) we delve into the efficient use of the vector units in the AIE tiles, proposing an architecture-specific micro-kernel for mixed precision arithmetic to address the strong demand for adaptive-precision inference in deep learning; and 3) we introduce a parallel design for GEMM that spans multiple AIE tiles, enhancing the computational throughput. We conduct experimental profiling, with up to 32 AI Engines, that demonstrates the high parallel scalability of the solution.

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

TL;DR

The paper tackles efficient GEMM on the Versal ACAP for DL workloads by formulating

with

, and

, and porting CPU-oriented GEMM techniques to Versal’s heterogeneous memory and AIEs. It introduces a memory-aware design that maps operands across a five-level memory hierarchy, and a architecture-specific UINT8 SIMD micro-kernel to exploit the Versal AIEs, while parallelizing the iteration space over multiple tiles. The approach demonstrates scalable performance up to 32 AIEs, achieving about 31.5 MACs/cycle on a single AIE and roughly 29.8 MACs/cycle with 32 AIEs, but identifies FPGA Ultra RAM bandwidth as a primary bottleneck in this setup. Overall, the work provides a concrete pathway for deploying high-performance GEMM on Versal ACAP and highlights data movement and memory bandwidth as critical levers for achieving higher DL throughput on this platform.

Abstract

Paper Structure (15 sections, 6 figures, 3 tables)

This paper contains 15 sections, 6 figures, 3 tables.

Introduction
High Performance GEMM on Conventional Architectures
Architecture of the Versal ACAP
Customizing GEMM for the Versal ACAP
Distributing the operands across the memory hierarchy
SIMD Micro-kernel for the AIE tile
Setting the cache configuration parameters
Parallelization of GEMM for the AIE tile grid
Communication protocols
Performance Analysis
Transfer costs for the micro-kernel
Arithmetic cost for the micro-kernel
Sustained performance
Scalability of the parallel design
Conclusions

Figures (6)

Figure 1: Baseline high performance algorithm for gemm. Top-Left: Blocked algorithm; Middle-Left: Micro-kernel; Bottom-Left: Packing of input matrix operands. Right: Data transfers across the memory hierarchy.
Figure 2: Block diagram of the Versal AI Core.
Figure 3: Mapping of gemm operands to the Versal ACAP memory hierarchy.
Figure 4: Simplified version of the $8 \times 8$, UINT8 micro-kernel for the AIE tile.
Figure 5: Simplified parallel implementation of gemm for the Versal ACAP.
...and 1 more figures

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

TL;DR

Abstract

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)