Towards a high-performance AI compiler with upstream MLIR

Renato Golin; Lorenzo Chelini; Adam Siemieniuk; Kavitha Madhu; Niranjan Hasabnis; Hans Pabst; Evangelos Georganas; Alexander Heinecke

Towards a high-performance AI compiler with upstream MLIR

Renato Golin, Lorenzo Chelini, Adam Siemieniuk, Kavitha Madhu, Niranjan Hasabnis, Hans Pabst, Evangelos Georganas, Alexander Heinecke

TL;DR

The paper tackles the challenge of delivering ninja-level performance for high-level linear algebra by marrying an open-source MLIR-based compiler flow with upstream Linalg-on-Tensors IR and a downstream XSMM/libxsmm path. It introduces tensor packing/unpacking with layout propagation, a tile-and-fuse optimization pipeline, and a lowering path to the libxsmm-based micro-kernel library, enabling automatic, hardware-conscious code generation without hand-tuned pragmas. The proof-of-concept shows input IR from TensorFlow and PyTorch can be lowered into optimized libxsmm calls, achieving over $90\%$ of ninja-written performance on diverse CPUs. The work argues for upstreaming high-level optimization passes to reduce maintenance and outlines a roadmap toward broader workloads, GPUs, and a learned cost-model to drive tiling and kernel selection.

Abstract

This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.

Towards a high-performance AI compiler with upstream MLIR

TL;DR

of ninja-written performance on diverse CPUs. The work argues for upstreaming high-level optimization passes to reduce maintenance and outlines a roadmap toward broader workloads, GPUs, and a learned cost-model to drive tiling and kernel selection.

Abstract

Paper Structure (19 sections, 8 figures)

This paper contains 19 sections, 8 figures.

Introduction
MLIR and the Linalg Dialect
Compilation Strategy
Why Linalg on Tensors?
Compiler Passes
Pack, Unpack, and Propagation
Tile and Fuse
Lowering to Hardware Dialects
The XSMM dialect
Parallelism
2D parallelism
AMX tile configuration hoisting
Results
Single-thread performance
Single-thread packing costs
...and 4 more sections

Figures (8)

Figure 1: A simplified view of the proposed compiler strategy. In gray are external components, in green are the upstream compiler technology while in blue are the potentially downstream parts. Boundaries depend on which ingress format and which hardware abstractions are used. XSMM is our choice of CPU library, and OpenCL is a potential choice for GPU libraries. The proposal is equally valid with dialects and further compilers (ex. LLVM) down the line.
Figure 2: Packed layout for GEMM operation. After tiling (smaller square), the tiles are transposed whole ("block-transpose"). For optimal multi-threaded locality we also group different blocks (single-colored areas) for each thread.
Figure 3: Pack propagation through a multi-layer model. Packed GEMMs propagate their layout to the following element-wise operations, exposing canonicalization in between layers to elide all intermediate packs and unpacks, leaving only the initial packs and final unpack.
Figure 4: Each layer is executed across all cores (data parallelism). For data locality, we block each thread within a single block (of tiles) within the original matrix (see figure \ref{['figure:packing-shapes']}). To amortize the cost of using the matrix extension on Sapphire Rapids, we hoist the setup and reset calls within each thread.
Figure 5: Single-thread results for all CPUs. The compiler's performance is on par against all hand-written results except c7i (SPR), where compute density can be affecting memory bandwidth. Note, the scale on the first three plots are up to 250 GFLOPS, while the last two are up to 2 TFLOPS.
...and 3 more figures

Towards a high-performance AI compiler with upstream MLIR

TL;DR

Abstract

Towards a high-performance AI compiler with upstream MLIR

Authors

TL;DR

Abstract

Table of Contents

Figures (8)