Table of Contents
Fetching ...

ACT: Automatically Generating Compiler Backends from Tensor Accelerator ISA Descriptions

Devansh Jain, Akash Pardeshi, Marco Frigo, Krut Patel, Kaustubh Khulbe, Jai Arora, Charith Mendis

TL;DR

ACT introduces a first-of-its-kind compiler backend generator that produces sound and complete accelerator backends from tensor accelerator ISA descriptions. It formalizes ISA descriptions, uses equality saturation for parameterized instruction selection, and applies constraint programming for memory allocation, with inter-phase fallbacks to guarantee completeness. The framework is instantiated for multiple accelerators (e.g., Gemmini, Intel AMX, and a QKV ADL-based design) and achieves performance on par with or better than hand-optimized kernels, while maintaining low compilation times. By automating backend generation from ISA descriptions, ACT enables rapid exploration and deployment of new tensor accelerators, reducing engineering effort and accelerating software-hardware co-design.

Abstract

Tensor compilers play a key role in enabling high-performance implementations of deep learning workloads. These compilers rely on existing CPU and GPU code generation backends to generate device-specific code. Recently, many tensor accelerators (neural processing units) have been proposed to further accelerate these workloads. Compared to commodity hardware, however, most of the proposed tensor accelerators do not have compiler backends with code generation support. Moreover, the accelerator designs are subject to fast iteration cycles, making it difficult to manually develop compiler backends similar to commodity hardware platforms. Therefore, to increase adoption and enable faster software development cycles for novel tensor accelerator designs, we need to make the compiler backend construction process more agile. To address this gap, we introduce ACT, a compiler backend generator that automatically generates compiler backends for tensor accelerators, given just the instruction set architecture (ISA) descriptions. We first formally specify the compiler backend generation problem that introduces a novel specification for describing tensor accelerator ISAs. Next, we design ACT such that it supports user-programmable memories and complex parameterized instructions that are prevalent in tensor accelerators. ACT uses a novel parameterized equality saturation-based instruction selection phase and a constraint programming-based memory allocation phase. We prove that compiler backends generated by ACT are sound and complete. Finally, we generate compiler backends for three accelerator platforms from industry and academia, and show that they match or outperform code written using hand-optimized kernel libraries while maintaining low compilation overheads.

ACT: Automatically Generating Compiler Backends from Tensor Accelerator ISA Descriptions

TL;DR

ACT introduces a first-of-its-kind compiler backend generator that produces sound and complete accelerator backends from tensor accelerator ISA descriptions. It formalizes ISA descriptions, uses equality saturation for parameterized instruction selection, and applies constraint programming for memory allocation, with inter-phase fallbacks to guarantee completeness. The framework is instantiated for multiple accelerators (e.g., Gemmini, Intel AMX, and a QKV ADL-based design) and achieves performance on par with or better than hand-optimized kernels, while maintaining low compilation times. By automating backend generation from ISA descriptions, ACT enables rapid exploration and deployment of new tensor accelerators, reducing engineering effort and accelerating software-hardware co-design.

Abstract

Tensor compilers play a key role in enabling high-performance implementations of deep learning workloads. These compilers rely on existing CPU and GPU code generation backends to generate device-specific code. Recently, many tensor accelerators (neural processing units) have been proposed to further accelerate these workloads. Compared to commodity hardware, however, most of the proposed tensor accelerators do not have compiler backends with code generation support. Moreover, the accelerator designs are subject to fast iteration cycles, making it difficult to manually develop compiler backends similar to commodity hardware platforms. Therefore, to increase adoption and enable faster software development cycles for novel tensor accelerator designs, we need to make the compiler backend construction process more agile. To address this gap, we introduce ACT, a compiler backend generator that automatically generates compiler backends for tensor accelerators, given just the instruction set architecture (ISA) descriptions. We first formally specify the compiler backend generation problem that introduces a novel specification for describing tensor accelerator ISAs. Next, we design ACT such that it supports user-programmable memories and complex parameterized instructions that are prevalent in tensor accelerators. ACT uses a novel parameterized equality saturation-based instruction selection phase and a constraint programming-based memory allocation phase. We prove that compiler backends generated by ACT are sound and complete. Finally, we generate compiler backends for three accelerator platforms from industry and academia, and show that they match or outperform code written using hand-optimized kernel libraries while maintaining low compilation overheads.

Paper Structure

This paper contains 79 sections, 6 theorems, 2 equations, 56 figures, 6 tables.

Key Result

Theorem 1

$\forall \mathsf{ISA}^{H} ,\, \textsc{Act}{}(\mathsf{ISA}^{H})$ is sound (Def. def:sound) and complete (Def. def:complete).

Figures (56)

  • Figure 1: Typical hierarchical compilation pipeline present in tensor compilers. Our solution, Act, focuses on the accelerator-specific compiler backends (shaded orange), i.e., compiling tensor kernels to accelerator ISA.
  • Figure 2: Self-attention kernels in Transformer models like BERT bert use QKV computation $\mathsf{softmax}(Q \cdot K^T) \cdot V$ over 4-dimensional tensors. For illustration purposes, we assume a single-batch single-head QKV computation $G_{QKV}$ over $\mathsf{bf16}[64,64]$ tensors.
  • Figure 3: Running example. (a) High-level accelerator design for a hypothetical QKV accelerator $H_{QKV}$. (b) Brief description of its instruction set $\Theta^{H_{QKV}}$. Tensor variables are colored based on their storage unit (${\color{myred} d_0}$, ${\color{myblue} d_1}$, ${\color{mygreen} d_2}$).
  • Figure 4: Execution of concrete instruction $\mathsf{load\_rm}(\mathsf{n} = 4, \mathsf{addr_{in}} = 0, \mathsf{addr_{out}} = 2)$ as a concrete tensor computation graph.
  • Figure 4: Statistics of selected oneDNN kernels and their optimized assembly code in oneDNN library
  • ...and 51 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5