Table of Contents
Fetching ...

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

Endri Taka, Andre Roesti, Joseph Melber, Pranathi Vasireddy, Kristof Denolf, Diana Marculescu

TL;DR

This work presents a unified methodology to optimize GEMM workloads on AMD Ryzen AI NPUs across two generations, XDNA and XDNA2, by coupling analytical modeling with hardware profiling to identify the optimal compute-memory balanced point. It introduces a multi-level tiling GEMM design that preserves regular DRAM layouts and leverages on-the-fly data transformations and explicit data movement to maximize throughput. End-to-end evaluations on two mini PCs demonstrate state-of-the-art performance, with up to 38.05 TOPS int8 on XDNA2 and 14.71 TOPS bf16 on XDNA2, and 6.76 TOPS int8 on XDNA. The study highlights the critical role of data movement, buffering strategies, and kernel balance in achieving high performance, and offers a generalizable framework for future NPU generations and GEMM-related workloads.

Abstract

The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2). This work provides significant insights into key performance aspects of optimizing GEMM workloads on Ryzen AI NPUs.

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

TL;DR

This work presents a unified methodology to optimize GEMM workloads on AMD Ryzen AI NPUs across two generations, XDNA and XDNA2, by coupling analytical modeling with hardware profiling to identify the optimal compute-memory balanced point. It introduces a multi-level tiling GEMM design that preserves regular DRAM layouts and leverages on-the-fly data transformations and explicit data movement to maximize throughput. End-to-end evaluations on two mini PCs demonstrate state-of-the-art performance, with up to 38.05 TOPS int8 on XDNA2 and 14.71 TOPS bf16 on XDNA2, and 6.76 TOPS int8 on XDNA. The study highlights the critical role of data movement, buffering strategies, and kernel balance in achieving high performance, and offers a generalizable framework for future NPU generations and GEMM-related workloads.

Abstract

The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2). This work provides significant insights into key performance aspects of optimizing GEMM workloads on Ryzen AI NPUs.

Paper Structure

This paper contains 19 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Architecture of Ryzen AI NPUs.
  • Figure 2: Data movement across the memory hierarchy in Ryzen AI NPUs: input buffer A (a) and output buffer C (b).
  • Figure 3: Proposed GEMM multi-level tiling scheme (a), and GEMM mapping strategy on XDNA (b) and XDNA2 (c).
  • Figure 4: GEMM performance while varying parameter $k_\text{mt}$ for bf16-bf16 96$\times$56$\times$96 (a) and int8-int16 128$\times$72$\times$112 (b).
  • Figure 5: Roofline GEMM performance sweeps for various matrix sizes on XDNA.
  • ...and 1 more figures