Mirage: A Multi-Level Superoptimizer for Tensor Programs

Mengdi Wu; Xinhao Cheng; Shengyu Liu; Chunan Shi; Jianan Ji; Kit Ao; Praveen Velliengiri; Xupeng Miao; Oded Padon; Zhihao Jia

Mirage: A Multi-Level Superoptimizer for Tensor Programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, Zhihao Jia

TL;DR

Mirage tackles the challenge of optimizing tensor programs on GPUs by introducing a multi-level, uniform $μ$Graph representation that spans kernel, block, and thread levels. It couples algebraic and schedule transformations with the discovery of new custom kernels, enabled by a pruning mechanism based on abstract expressions and a probabilistic, field-based equivalence verifier that provides strong theoretical guarantees for Lax programs. The system is implemented with an ILP-based layout optimizer, depth-based scheduling, and memory planning, and demonstrated to outperform existing approaches by up to $3.3\times$ on six DNN benchmarks across $A100$ and $H100$ GPUs. This approach reduces manual kernel engineering while delivering substantial practical speedups, accelerating end-to-end DNN inference and enabling more aggressive automatic optimization of tensor programs.

Abstract

We introduce Mirage, the first multi-level superoptimizer for tensor programs. A key idea in Mirage is $μ$Graphs, a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy. $μ$Graphs enable Mirage to discover novel optimizations that combine algebraic transformations, schedule transformations, and generation of new custom kernels. To navigate the large search space, Mirage introduces a pruning technique based on abstraction that significantly reduces the search space and provides a certain optimality guarantee. To ensure that the optimized $μ$Graph is equivalent to the input program, Mirage introduces a probabilistic equivalence verification procedure with strong theoretical guarantees. Our evaluation shows that Mirage outperforms existing approaches by up to 3.3$\times$ even for DNNs that are widely used and heavily optimized. Mirage is publicly available at https://github.com/mirage-project/mirage.

Mirage: A Multi-Level Superoptimizer for Tensor Programs

TL;DR

Mirage tackles the challenge of optimizing tensor programs on GPUs by introducing a multi-level, uniform

Graph representation that spans kernel, block, and thread levels. It couples algebraic and schedule transformations with the discovery of new custom kernels, enabled by a pruning mechanism based on abstract expressions and a probabilistic, field-based equivalence verifier that provides strong theoretical guarantees for Lax programs. The system is implemented with an ILP-based layout optimizer, depth-based scheduling, and memory planning, and demonstrated to outperform existing approaches by up to

on six DNN benchmarks across

and

GPUs. This approach reduces manual kernel engineering while delivering substantial practical speedups, accelerating end-to-end DNN inference and enabling more aggressive automatic optimization of tensor programs.

Abstract

We introduce Mirage, the first multi-level superoptimizer for tensor programs. A key idea in Mirage is

Graphs, a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.

Graphs enable Mirage to discover novel optimizations that combine algebraic transformations, schedule transformations, and generation of new custom kernels. To navigate the large search space, Mirage introduces a pruning technique based on abstraction that significantly reduces the search space and provides a certain optimality guarantee. To ensure that the optimized

Graph is equivalent to the input program, Mirage introduces a probabilistic equivalence verification procedure with strong theoretical guarantees. Our evaluation shows that Mirage outperforms existing approaches by up to 3.3

even for DNNs that are widely used and heavily optimized. Mirage is publicly available at https://github.com/mirage-project/mirage.

Paper Structure (49 sections, 3 theorems, 3 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 49 sections, 3 theorems, 3 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Expression-guided $\mu$Graph generator.
Probabilistic equivalence verifier.
$\mu$Graph optimizer.
Evaluation results.
Multi-Level Graph Representation
GPU hierarchy.
Kernel graph.
Block graph.
Grid dimensions.
For-loop body.
Thread graph.
Tensor layout.
Comparison with prior work.
Case Study: RMSNorm
...and 34 more sections

Key Result

Theorem 1

For an input $\mu$Graph $G_0$, and a $\mu$Graph $G$ equivalent to $G_0$, if $A_\text{eq} \models E(G_0) = E(G)$ then $G$ will be generated by algo:generate.

Figures (12)

Figure 1: An overview of Mirage.
Figure 2: GPU compute and memory hierarchy.
Figure 3: \ref{['fig:rms_norm_baseline']} is the computation graph for RMSNorm and MatMul. \ref{['fig:rms_norm_mlso']} shows the best $\mu$Graph discovered by Mirage for computing RMSNorm and MatMul, which fuses the computation in a single kernel to reduce device memory access and kernel launch overhead, outperforms existing approaches by 1.9$\times$. Numbers in brackets indicate tensor shapes, and numbers in braces show the imap, omap, or fmap for the corresponding operators.
Figure 4: Demonstrating how an input tensor is partitioned across blocks and for-loop iterations with imap and fmap.
Figure 5: An overview of the $\mu$Graph generator.
...and 7 more figures

Theorems & Definitions (6)

Definition 2.1: $\mu$Graph Validity
Theorem 1: Pruning via Abstract Expressions
proof
Definition 5.1: Lax $\mu$Graph
Theorem 2
Theorem 3

Mirage: A Multi-Level Superoptimizer for Tensor Programs

TL;DR

Abstract

Mirage: A Multi-Level Superoptimizer for Tensor Programs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (6)