Towards a high-performance AI compiler with upstream MLIR
Renato Golin, Lorenzo Chelini, Adam Siemieniuk, Kavitha Madhu, Niranjan Hasabnis, Hans Pabst, Evangelos Georganas, Alexander Heinecke
TL;DR
The paper tackles the challenge of delivering ninja-level performance for high-level linear algebra by marrying an open-source MLIR-based compiler flow with upstream Linalg-on-Tensors IR and a downstream XSMM/libxsmm path. It introduces tensor packing/unpacking with layout propagation, a tile-and-fuse optimization pipeline, and a lowering path to the libxsmm-based micro-kernel library, enabling automatic, hardware-conscious code generation without hand-tuned pragmas. The proof-of-concept shows input IR from TensorFlow and PyTorch can be lowered into optimized libxsmm calls, achieving over $90\%$ of ninja-written performance on diverse CPUs. The work argues for upstreaming high-level optimization passes to reduce maintenance and outlines a roadmap toward broader workloads, GPUs, and a learned cost-model to drive tiling and kernel selection.
Abstract
This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.
