Table of Contents
Fetching ...

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

Tuo Dai, Bizhao Shi, Guojie Luo

TL;DR

This work tackles the efficient mapping of uniform recurrences to the Versal ACAP to maximize AIE array utilization. It introduces WideSA, a polyhedral-model-based mapping scheme that generates systolic-like layouts on the AIE array and uses a routing-aware PLIO assignment. An automatic framework produces executable code for heterogeneous backends (AIE, PL, host), enabling end-to-end compilation. On the VCK5000, WideSA achieves up to $4.15$ TOPS and up to $1.11\times$ improvement over state-of-the-art accelerators, with high AIE utilization, demonstrating practical impact for uniform recurrence workloads.

Abstract

The Versal Adaptive Compute Acceleration Platform (ACAP) is a new architecture that combines AI Engines (AIEs) with reconfigurable fabric. This architecture offers significant acceleration potential for uniform recurrences in various domains, such as deep learning, high-performance computation, and signal processing. However, efficiently mapping these computations onto the Versal ACAP architecture while achieving high utilization of AIEs poses a challenge. To address this issue, we propose a mapping scheme called \fname, which aims to accelerate uniform recurrences on the Versal ACAP architecture by leveraging the features of both the hardware and the computations. Considering the array architecture of AIEs, our approach utilizes space-time transformations based on the polyhedral model to generate legally optimized systolic array mappings. Concurrently, we have developed a routing-aware PLIO assignment algorithm tailored for communication on the AIE array, and the algorithm aims at successful compilation while maximizing array utilization. Furthermore, we introduce an automatic mapping framework. This framework is designed to generate the corresponding executable code for uniform recurrences, which encompasses the AIE kernel program, programmable logic bitstreams, and the host program. The experimental results validate the effectiveness of our mapping scheme. Specifically, when applying our scheme to matrix multiplication computations on the VCK5000 board, we achieve a throughput of 4.15TOPS on float data type, which is 1.11$\times$ higher compared to the state-of-the-art accelerator on the Versal ACAP architecture.

WideSA: A High Array Utilization Mapping Scheme for Uniform Recurrences on the Versal ACAP Architecture

TL;DR

This work tackles the efficient mapping of uniform recurrences to the Versal ACAP to maximize AIE array utilization. It introduces WideSA, a polyhedral-model-based mapping scheme that generates systolic-like layouts on the AIE array and uses a routing-aware PLIO assignment. An automatic framework produces executable code for heterogeneous backends (AIE, PL, host), enabling end-to-end compilation. On the VCK5000, WideSA achieves up to TOPS and up to improvement over state-of-the-art accelerators, with high AIE utilization, demonstrating practical impact for uniform recurrence workloads.

Abstract

The Versal Adaptive Compute Acceleration Platform (ACAP) is a new architecture that combines AI Engines (AIEs) with reconfigurable fabric. This architecture offers significant acceleration potential for uniform recurrences in various domains, such as deep learning, high-performance computation, and signal processing. However, efficiently mapping these computations onto the Versal ACAP architecture while achieving high utilization of AIEs poses a challenge. To address this issue, we propose a mapping scheme called \fname, which aims to accelerate uniform recurrences on the Versal ACAP architecture by leveraging the features of both the hardware and the computations. Considering the array architecture of AIEs, our approach utilizes space-time transformations based on the polyhedral model to generate legally optimized systolic array mappings. Concurrently, we have developed a routing-aware PLIO assignment algorithm tailored for communication on the AIE array, and the algorithm aims at successful compilation while maximizing array utilization. Furthermore, we introduce an automatic mapping framework. This framework is designed to generate the corresponding executable code for uniform recurrences, which encompasses the AIE kernel program, programmable logic bitstreams, and the host program. The experimental results validate the effectiveness of our mapping scheme. Specifically, when applying our scheme to matrix multiplication computations on the VCK5000 board, we achieve a throughput of 4.15TOPS on float data type, which is 1.11 higher compared to the state-of-the-art accelerator on the Versal ACAP architecture.
Paper Structure (22 sections, 3 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Versal ACAP Architecture
  • Figure 2: Kernel Scope Demarcation
  • Figure 3: Polyhedral Model-Based Systolic Mapping
  • Figure 4: Communication Methods for PLIO Ports Utilization Reduction
  • Figure 5: Overview of WideSA Automatic Framework
  • ...and 1 more figures