Table of Contents
Fetching ...

Seamless acceleration of Fortran intrinsics via AMD AI engines

Nick Brown, Gabriel Rodríguez Canal

TL;DR

The paper addresses the challenge of delivering energy-efficient HPC performance by offloading Fortran intrinsics to AMD's AI Engines integrated in Ryzen AI CPUs. It introduces a compiler workflow based on Flang, MLIR, and xDSL that lowers Fortran linear algebra intrinsics to AIE kernels and uses an xrt_wrapper to manage CPU-NPU interaction. Key contributions include a seamless, no-code-offload path for Fortran intrinsics, a library of templated AIE IRs specialized per invocation, and experimental evidence across reductions, transpositions, and matmul. Although initial setup overhead and certain data-type limitations temper gains for some kernels, the work demonstrates practical potential of AIEs for HPC, particularly for large or highly vectorizable operations.

Abstract

A major challenge that the HPC community faces is how to continue delivering the performance demanded by scientific programmers, whilst meeting an increased emphasis on sustainable operations. Specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have been demonstrated to provide significant energy efficiency advantages, however a major challenge is that to most effectively program these architectures requires significant expertise and investment of time which is a major blocker. Fortran in the lingua franca of scientific computing, and in this paper we explore automatically accelerating Fortran intrinsics via the AIEs in AMD's Ryzen AI CPU. Leveraging the open source Flang compiler and MLIR ecosystem, we describe an approach that lowers the MLIR linear algebra dialect to AMD's AIE dialects, and demonstrate that for suitable workloads the AIEs can provide significant performance advantages over the CPU without any code modifications required by the programmer.

Seamless acceleration of Fortran intrinsics via AMD AI engines

TL;DR

The paper addresses the challenge of delivering energy-efficient HPC performance by offloading Fortran intrinsics to AMD's AI Engines integrated in Ryzen AI CPUs. It introduces a compiler workflow based on Flang, MLIR, and xDSL that lowers Fortran linear algebra intrinsics to AIE kernels and uses an xrt_wrapper to manage CPU-NPU interaction. Key contributions include a seamless, no-code-offload path for Fortran intrinsics, a library of templated AIE IRs specialized per invocation, and experimental evidence across reductions, transpositions, and matmul. Although initial setup overhead and certain data-type limitations temper gains for some kernels, the work demonstrates practical potential of AIEs for HPC, particularly for large or highly vectorizable operations.

Abstract

A major challenge that the HPC community faces is how to continue delivering the performance demanded by scientific programmers, whilst meeting an increased emphasis on sustainable operations. Specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have been demonstrated to provide significant energy efficiency advantages, however a major challenge is that to most effectively program these architectures requires significant expertise and investment of time which is a major blocker. Fortran in the lingua franca of scientific computing, and in this paper we explore automatically accelerating Fortran intrinsics via the AIEs in AMD's Ryzen AI CPU. Leveraging the open source Flang compiler and MLIR ecosystem, we describe an approach that lowers the MLIR linear algebra dialect to AMD's AIE dialects, and demonstrate that for suitable workloads the AIEs can provide significant performance advantages over the CPU without any code modifications required by the programmer.

Paper Structure

This paper contains 8 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of MLIR-based Fortran compilation flow developed by brown2024fully and based upon Flang to generate LLVM-IR.
  • Figure 2: Illustration of our overarching compiler approach that offloads selected linear algebra operations to the AIE array