WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

Tuo Dai; Bizhao Shi; Guojie Luo

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

Tuo Dai, Bizhao Shi, Guojie Luo

TL;DR

This work tackles the efficient mapping of uniform recurrences to the Versal ACAP to maximize AIE array utilization. It introduces WideSA, a polyhedral-model-based mapping scheme that generates systolic-like layouts on the AIE array and uses a routing-aware PLIO assignment. An automatic framework produces executable code for heterogeneous backends (AIE, PL, host), enabling end-to-end compilation. On the VCK5000, WideSA achieves up to $4.15$ TOPS and up to $1.11\times$ improvement over state-of-the-art accelerators, with high AIE utilization, demonstrating practical impact for uniform recurrence workloads.

Abstract

The Versal Adaptive Compute Acceleration Platform (ACAP) is a new architecture that combines AI Engines (AIEs) with reconfigurable fabric. This architecture offers significant acceleration potential for uniform recurrences in various domains, such as deep learning, high-performance computation, and signal processing. However, efficiently mapping these computations onto the Versal ACAP architecture while achieving high utilization of AIEs poses a challenge. To address this issue, we propose a mapping scheme called \fname, which aims to accelerate uniform recurrences on the Versal ACAP architecture by leveraging the features of both the hardware and the computations. Considering the array architecture of AIEs, our approach utilizes space-time transformations based on the polyhedral model to generate legally optimized systolic array mappings. Concurrently, we have developed a routing-aware PLIO assignment algorithm tailored for communication on the AIE array, and the algorithm aims at successful compilation while maximizing array utilization. Furthermore, we introduce an automatic mapping framework. This framework is designed to generate the corresponding executable code for uniform recurrences, which encompasses the AIE kernel program, programmable logic bitstreams, and the host program. The experimental results validate the effectiveness of our mapping scheme. Specifically, when applying our scheme to matrix multiplication computations on the VCK5000 board, we achieve a throughput of 4.15TOPS on float data type, which is 1.11$\times$ higher compared to the state-of-the-art accelerator on the Versal ACAP architecture.

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

TL;DR

TOPS and up to

improvement over state-of-the-art accelerators, with high AIE utilization, demonstrating practical impact for uniform recurrence workloads.

Abstract

higher compared to the state-of-the-art accelerator on the Versal ACAP architecture.

Paper Structure (22 sections, 3 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Background
Versal ACAP Architecture and Workflow
Hardware Features
Software Programming Model
Uniform Recurrences and Systolic Array Mapping
Systolic Mapping Scheme on ACAP
Kernel Scope Demarcation
Systolic Mapping Generation
Space-time Transformation
Array Partition
Latency Hiding
Multiple Threading
Placement and Routing Constraints Construction
Graph Builder
...and 7 more sections

Figures (6)

Figure 1: Versal ACAP Architecture
Figure 2: Kernel Scope Demarcation
Figure 3: Polyhedral Model-Based Systolic Mapping
Figure 4: Communication Methods for PLIO Ports Utilization Reduction
Figure 5: Overview of WideSA Automatic Framework
...and 1 more figures

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

TL;DR

Abstract

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

Authors

TL;DR

Abstract

Table of Contents

Figures (6)