Table of Contents
Fetching ...

AMD Versal AI-Engines for fixed latency environments

Ioannis Xiotidis, Noah Clarke Hall, Tianjia Du, Nikos Konstantinidis, David Miller

Abstract

Complex, high-throughput data acquisition and processing systems, such as those used in high-energy physics experiments, are increasingly moving sophisticated pattern recognition and data compression algorithms closer to the sensors themselves. To meet these needs, programmable device manufacturers offer multi-silicon die packages that commonly include dedicated co-processors within the same package. We present a technical study of a new family of such co-processors from AMD Xilinx, the Adaptive Intelligence (AI) Engine, or AIE, as part of the Versal architecture. Specifically, we focus on the deployment capabilities of AIEs in fixed latency environments such as those typically found in colliding beam experiments like those at the Large Hadron Collider. We evaluate the performance of a vectorised implementation of both a Boosted Decision Tree (BDT) and a Convolutional Neural Network (CNN), thereby demonstrating the feasibility of deploying AIEs for ML applications in such environments and their use as possible alternatives to traditional programmable logic-based implementations.

AMD Versal AI-Engines for fixed latency environments

Abstract

Complex, high-throughput data acquisition and processing systems, such as those used in high-energy physics experiments, are increasingly moving sophisticated pattern recognition and data compression algorithms closer to the sensors themselves. To meet these needs, programmable device manufacturers offer multi-silicon die packages that commonly include dedicated co-processors within the same package. We present a technical study of a new family of such co-processors from AMD Xilinx, the Adaptive Intelligence (AI) Engine, or AIE, as part of the Versal architecture. Specifically, we focus on the deployment capabilities of AIEs in fixed latency environments such as those typically found in colliding beam experiments like those at the Large Hadron Collider. We evaluate the performance of a vectorised implementation of both a Boosted Decision Tree (BDT) and a Convolutional Neural Network (CNN), thereby demonstrating the feasibility of deploying AIEs for ML applications in such environments and their use as possible alternatives to traditional programmable logic-based implementations.
Paper Structure (9 sections, 4 equations, 9 figures, 2 tables)

This paper contains 9 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The ATLAS TDAQ system for the HL-LHC upgrade. The lilac boxes show the Level-0 hardware trigger implemented in custom hardware boards. The green and yellow boxes show the data streaming infrastructure based on custom hardware cards and commodity servers. Finally with salmon the Event Filter processing farm based on commodity servers and potentially GPUs or FPGAs. atlas_phase_II
  • Figure 2: AI Engine tile sub-components where the 32 kB memory is indicated with the grey box, the input/output components are on the left (streaming or DMA based), the two processing units (vector, scalar) are indicated with the light blue box. AIE_1
  • Figure 3: Parallelisation of BDT Trees mapped in the AI-Engine Vector processor unit.
  • Figure 4: Parallelisation of 2D convolution kernel for AI-Engines tiles, indicating the vector processor instructions utilized.
  • Figure 5: Comparison of 100x randomly generated 16-features between the AI Engine emulation and XGBoost software results.
  • ...and 4 more figures