Table of Contents
Fetching ...

Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

Arun Thangamani, Md Asghar Ahmad Shahid, Adam Siemieniuk, Rolf Morel, Renato Golin, Alexander Heinecke

TL;DR

The paper tackles the challenge of achieving near-peak performance for contraction-based ML workloads without depending on hand-tuned kernels or external libraries. It introduces a compiler-driven path that auto-generates target-specific nanokernels from MLIR, incorporating BRGEMM, VNNI packing, and ISA-specific primitives (e.g., AMX, AVX2/AVX512). The approach demonstrates production-grade performance, rivaling libxsmm across FP32 and BF16 paths on recent Intel CPUs, while enabling rapid adaptation to new data layouts and hardware features through MLIR dialects and dedicated lowering passes. This work promises a scalable, library-free trajectory for high-performance GEMM-based workloads, with practical impact on simplifying deployment and improving portability across evolving architectures.

Abstract

The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.

Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

TL;DR

The paper tackles the challenge of achieving near-peak performance for contraction-based ML workloads without depending on hand-tuned kernels or external libraries. It introduces a compiler-driven path that auto-generates target-specific nanokernels from MLIR, incorporating BRGEMM, VNNI packing, and ISA-specific primitives (e.g., AMX, AVX2/AVX512). The approach demonstrates production-grade performance, rivaling libxsmm across FP32 and BF16 paths on recent Intel CPUs, while enabling rapid adaptation to new data layouts and hardware features through MLIR dialects and dedicated lowering passes. This work promises a scalable, library-free trajectory for high-performance GEMM-based workloads, with practical impact on simplifying deployment and improving portability across evolving architectures.

Abstract

The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.

Paper Structure

This paper contains 40 sections, 7 figures, 3 tables, 3 algorithms.

Figures (7)

  • Figure 1: Packing a 4$\times$4 row-major order matrix to VNNI
  • Figure 2: BRGEMM register tiling described in tpp-tiling for (a) 32 vector registers (b) 16 vector registers (c) 8 2D register tiles and (d) 16 vector registers with loads of A and B are swapped.
  • Figure 3: AVX2 BF16 Packed Operations
  • Figure 4: Converting Flat to VNNI packed layout using vpunpcklwd/hwd operations.
  • Figure 5: % GFLOPS performance of MLP FP32 AVX nanokernels vs libxsmm on EMR, SRF, and ARL machines.
  • ...and 2 more figures