Table of Contents
Fetching ...

PolyBlocks: A Compiler Infrastructure for AI Chips and Programming Frameworks

Uday Bondhugula, Akshay Baviskar, Navdeep Katel, Vimal Patel, Anoop JS, Arnab Dutta

TL;DR

Experimental results from evaluating PolyBlocks-powered just-in-time compilation for PyTorch and JAX targeting NVIDIA GPUs show that it is able to match or outperform Torch Inductor and XLA in several cases, although the latter rely on a combination of vendor libraries and code generation.

Abstract

We present the design and implementation of PolyBlocks, a modular and reusable MLIR-based compiler infrastructure for AI programming frameworks and AI chips. PolyBlocks is based on pass pipelines that compose transformations on loop nests and SSA, primarily relying on lightweight affine access analysis; the transformations are stitched together in specialized ways to realize high-performance code automatically by the use of analytical cost models and heuristics. The optimizations in these passes include multi-level tiling, fusion, on-chip scratchpad usage, mapping matmuls and convolutions to matrix units, fusing the attention layer, and several other transformations for parallelism and locality. They have been developed in a way that makes it easy to build PolyBlocks-based compilers to target new chips, reusing much of the infrastructure. PolyBlocks' design and architecture enable fully automatic code generation from high-level frameworks to low-level target-specific intrinsics. Experimental results from evaluating PolyBlocks-powered just-in-time compilation for PyTorch and JAX targeting NVIDIA GPUs show that it is able to match or outperform Torch Inductor and XLA in several cases, although the latter rely on a combination of vendor libraries and code generation. For individual operators like matmuls and convolutions, PolyBlocks-generated code is competitive with the best vendor-tuned libraries or hand-written kernels.

PolyBlocks: A Compiler Infrastructure for AI Chips and Programming Frameworks

TL;DR

Experimental results from evaluating PolyBlocks-powered just-in-time compilation for PyTorch and JAX targeting NVIDIA GPUs show that it is able to match or outperform Torch Inductor and XLA in several cases, although the latter rely on a combination of vendor libraries and code generation.

Abstract

We present the design and implementation of PolyBlocks, a modular and reusable MLIR-based compiler infrastructure for AI programming frameworks and AI chips. PolyBlocks is based on pass pipelines that compose transformations on loop nests and SSA, primarily relying on lightweight affine access analysis; the transformations are stitched together in specialized ways to realize high-performance code automatically by the use of analytical cost models and heuristics. The optimizations in these passes include multi-level tiling, fusion, on-chip scratchpad usage, mapping matmuls and convolutions to matrix units, fusing the attention layer, and several other transformations for parallelism and locality. They have been developed in a way that makes it easy to build PolyBlocks-based compilers to target new chips, reusing much of the infrastructure. PolyBlocks' design and architecture enable fully automatic code generation from high-level frameworks to low-level target-specific intrinsics. Experimental results from evaluating PolyBlocks-powered just-in-time compilation for PyTorch and JAX targeting NVIDIA GPUs show that it is able to match or outperform Torch Inductor and XLA in several cases, although the latter rely on a combination of vendor libraries and code generation. For individual operators like matmuls and convolutions, PolyBlocks-generated code is competitive with the best vendor-tuned libraries or hand-written kernels.
Paper Structure (29 sections, 19 figures, 4 tables)

This paper contains 29 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: An indicative figure showing the trade-offs between high, mid, and low-level programming frameworks while delivering comparable performance. Matmul with PyTorch, Triton, and CUDA, all of which deliver close to peak performance. CUDA courtesy: Lei Mao cudaoptgemmleimao.
  • Figure 2: PyTorch functions JIT compiled with PolyBlocks. Incremental programmability.
  • Figure 3: PolyBlocks compiler engine powering multiple programming frameworks and hardware. MLIR dialect lowering paths and how new targets are supported (XYZ in this figure).
  • Figure 4: Overall components and high-level architecture of a compiler built from PolyBlocks.
  • Figure 5: A specialized packing into an on-chip memory buffer: note contiguity in two-sized intervals along %j.
  • ...and 14 more figures