Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

Ilias Papalamprou; Dimosthenis Masouros; Ioannis Loudaros; Francky Catthoor; Dimitrios Soudris

Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

Ilias Papalamprou, Dimosthenis Masouros, Ioannis Loudaros, Francky Catthoor, Dimitrios Soudris

TL;DR

The paper tackles the energy-performance bottleneck of GEMM on AMD Versal ACAP by introducing an ML-guided, automated framework to map GEMM across AIEs, PL, and DDR. It builds an on-board dataset of ~$6000$ hardware designs across $18$ GEMM workloads, trains Gradient Boosted Decision Tree models to predict latency, power, and PL resource utilization, and uses these models to drive a design space exploration that yields Pareto-optimal mappings. The offline phase builds robust predictors, while the online phase performs rapid DSE (a GA helps accelerate search), producing mappings that achieve significant geomean gains in throughput ($1.23\times$) and energy efficiency ($1.25\times$) over state-of-the-art methods, with up to $2.5-2.7\times$ improvements in several cases. The results also show strong Pareto-front quality and competitive performance relative to embedded GPUs, highlighting the practical impact of ML-guided, energy-aware GEMM optimization on heterogeneous ACAP architectures.

Abstract

General Matrix Multiplication (GEMM) is a fundamental operation in many scientific workloads, signal processing, and particularly deep learning. It is often a bottleneck for performance and energy efficiency, especially in edge environments with tight resource and power constraints. AMD's Versal ACAP offers heterogeneous components (AIEs, PL, PS) that can address these challenges, but mapping GEMM across them is complex, with prior works largely overlooking energy-performance trade-offs. In this paper, we propose an automated framework for Versal ACAP that generates GEMM mappings optimized for either performance or energy efficiency. Unlike prior analytical approaches, our method leverages a Machine Learning (ML) model, trained on approximately 6000 on-board experiments of different GEMM mappings, to guide Design Space Exploration, yielding more efficient designs. Evaluation on the Versal VCK190 shows geomean improvements of 1.23x (up to 2.5x) in throughput and 1.25x (up to 2.7x) in energy efficiency over state-of-the-art frameworks.

Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

TL;DR

Abstract

Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)