Table of Contents
Fetching ...

AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines

Dimitrios Danopoulos, Enrico Lupi, Chang Sun, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini

TL;DR

AIE4ML introduces an end-to-end compiler that transforms quantized neural networks into optimized, fully on-chip firmware for AMD's AIE-ML hardware, addressing the gap in multi-layer, on-chip execution on 2D AIE fabrics. The framework uses a dedicated IR, a pass-based pipeline, and a graph-placement algorithm to achieve near-peak, distributed performance with memory tiles handling all inter-layer data movement. It demonstrates high single-kernel efficiency, scalable multi-layer throughput across vast tile counts, and competitive cross-architecture performance, including 82.2% of INT8 peak under GEMM workloads and 113.4 TOPS on the AIE-ML device for representative workloads. The work highlights practical, low-latency AI inference suitable for trigger and real-time systems, validating end-to-end feasibility and outlining a path for broader operator support and future AIE generations.

Abstract

Efficient AI inference on AMD's Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first-generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices, also with forward compatibility for the newer AIE-MLv2 architecture. At the single-kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE-ML fabric and exploit its dedicated memory tiles to stay entirely on-chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear-layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi-layer implementations, our approach systematically derives deterministic, compact, and topology-optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high-level tools such as hls4ml or PyTorch while preserving bit-exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single-kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on-chip data movement. With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical companion for ultra-low-latency environments such as trigger systems in particle physics experiments.

AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines

TL;DR

AIE4ML introduces an end-to-end compiler that transforms quantized neural networks into optimized, fully on-chip firmware for AMD's AIE-ML hardware, addressing the gap in multi-layer, on-chip execution on 2D AIE fabrics. The framework uses a dedicated IR, a pass-based pipeline, and a graph-placement algorithm to achieve near-peak, distributed performance with memory tiles handling all inter-layer data movement. It demonstrates high single-kernel efficiency, scalable multi-layer throughput across vast tile counts, and competitive cross-architecture performance, including 82.2% of INT8 peak under GEMM workloads and 113.4 TOPS on the AIE-ML device for representative workloads. The work highlights practical, low-latency AI inference suitable for trigger and real-time systems, validating end-to-end feasibility and outlining a path for broader operator support and future AIE generations.

Abstract

Efficient AI inference on AMD's Versal AI Engine (AIE) is challenging due to tightly coupled VLIW execution, explicit datapaths, and local memory management. Prior work focused on first-generation AIE kernel optimizations, without tackling full neural network execution across the 2D array. In this work, we present AIE4ML, the first comprehensive framework for converting AI models automatically into optimized firmware targeting the AIE-ML generation devices, also with forward compatibility for the newer AIE-MLv2 architecture. At the single-kernel level, we attain performance close to the architectural peak. At the graph and system levels, we provide a structured parallelization method that can scale across the 2D AIE-ML fabric and exploit its dedicated memory tiles to stay entirely on-chip throughout the model execution. As a demonstration, we designed a generalized and highly efficient linear-layer implementation with intrinsic support for fused bias addition and ReLU activation. Also, as our framework necessitates the generation of multi-layer implementations, our approach systematically derives deterministic, compact, and topology-optimized placements tailored to the physical 2D grid of the device through a novel graph placement and search algorithm. Finally, the framework seamlessly accepts quantized models imported from high-level tools such as hls4ml or PyTorch while preserving bit-exactness. In layer scaling benchmarks, we achieve up to 98.6% efficiency relative to the single-kernel baseline, utilizing 296 of 304 AIE tiles (97.4%) of the device with entirely on-chip data movement. With evaluations across real-world model topologies, we demonstrate that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical companion for ultra-low-latency environments such as trigger systems in particle physics experiments.

Paper Structure

This paper contains 16 sections, 4 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overall hardware design of AIE4ML. Left: Blocked layer kernel using the aie::mmul API and the 2$\times$2 accumulator scheme, shown here with an illustrative $\langle 2,4,4\rangle$ tile size. Middle: Layer-level scaling using cascade rows and input broadcasting to distribute inputs and combine partial sums across the AI engine array. Right: Cross-layer pipelining through memory tiles, which provide zero-padding, re-tiling, and activation distribution to enable fully on-chip multi-layer execution.
  • Figure 2: Overview of the AIE4ML compilation pipeline. A high-level network is parsed with hls4ml and then lowered into an AIE-specific intermediate representation (IR), processed by a sequence of passes that resolve quantization, tiling, packing, graph connectivity, placement, and finally emitted as an optimized AIE project ready for build or simulation.
  • Figure 3: Automatic placement based on B&B algorithm (a) compared with two greedy baselines (b,c) on a 38 $\times$ 8 AIE array (start at $(0,0)$, $\lambda{=}1.0$, $\mu{=}0.05$). B&B yields shorter inter-layer connections and lower-row bias.
  • Figure 4: Scaling of a single linear layer (including bias and ReLU) across increasing AIE tiles. Red dashed lines mark maximum utilization at 296 of 304 tiles (97.4% utilization).