A Latency-Constrained, Gated Recurrent Unit (GRU) Implementation in the Versal AI Engine
M. Sapkas, A. Triossi, M. Zanetti
TL;DR
This work tackles latency-constrained GRU inference in real-time physics readout pipelines by implementing a latency-aware GRU on the Versal AI Engine (AIE) and adopting a hybrid AIE-PL design. It develops a workload-distribution strategy for the AIE's vector processors and compares two matrix–vector execution approaches—Column-wise Cascade and Row-wise Streams—plus a PL-accelerated aggregation path to reduce latency. The key contributions include a proof-of-concept latency-focused GRU on Versal, the first use of interface tiles as data-path aggregators via a PL kernel, and hardware-latency measurements showing favorable scaling with the hidden state while highlighting resource-driven limits. The results demonstrate the feasibility of deploying adaptable neural networks in real-time particle-physics readout chains, offering a flexible alternative to fixed-function algorithms for low-latency applications.
Abstract
This work explores the use of the AMD Xilinx Versal Adaptable Intelligent Engine(AIE) to accelerate Gated Recurrent Unit (GRU) inference for latency-Constrained applications. We present a custom workload distribution framework across the AIE's vector processors and propose a hybrid AIE - Programmable Logic (PL) design to optimize computational efficiency. Our approach highlights the potential of deploying adaptable neural networks in real-time environments such as online preprocessing in the readout chain of a physics experiment, offering a flexible alternative to traditional fixed-function algorithms.
