Table of Contents
Fetching ...

Equivalence Checking of ML GPU Kernels

Kshitij Dubey, Benjamin Driscoll, Anjiang Wei, Neeraj Kayal, Rahul Sharma, Alex Aiken

TL;DR

Volta addresses the challenge of formally verifying the semantic equivalence of ML GPU kernels produced by humans, LLMs, or compilers. It introduces a PTX-level, symbolically-executed equivalence checker for structured-CTAs, proving soundness and completeness within this kernel class via a confluence property that ensures schedule-independence. The approach supports data-race and deadlock detection, and a decision procedure for equalities involving additions, multiplications, and exponentials, including a proof-of-decidability for identities of the form $\sum_i p_i(\mathbf{x}) e^{h_i(\mathbf{x})} = 0$. Empirically, Volta verifies reductions, MatMul, convolutions, and attention across human-, LLM-, and compiler-generated kernels, completing within minutes and uncovering correctness bugs and data-race issues, thereby enabling reliable optimization of ML workloads on GPUs with tensor cores.

Abstract

With the rapid progress of deep learning and large language models (LLMs), companies now spend enormous sums executing GPU kernels. These kernels have, therefore, become prime targets for aggressive optimization. Recent efforts increasingly leverage LLMs to generate GPU kernels, but make no formal guarantees about the generated kernels. We present the first equivalence checker for GPU kernels and use it to formally verify the correctness of machine learning (ML) kernels optimized by hand, by LLMs, and by compilers. We show that our equivalence checker is sound and, for a well-defined class of GPU kernels which includes the programs of interest, complete. Our implementation, VOLTA, can verify ML computations such as convolutions, matrix multiplications, and various attention mechanisms.

Equivalence Checking of ML GPU Kernels

TL;DR

Volta addresses the challenge of formally verifying the semantic equivalence of ML GPU kernels produced by humans, LLMs, or compilers. It introduces a PTX-level, symbolically-executed equivalence checker for structured-CTAs, proving soundness and completeness within this kernel class via a confluence property that ensures schedule-independence. The approach supports data-race and deadlock detection, and a decision procedure for equalities involving additions, multiplications, and exponentials, including a proof-of-decidability for identities of the form . Empirically, Volta verifies reductions, MatMul, convolutions, and attention across human-, LLM-, and compiler-generated kernels, completing within minutes and uncovering correctness bugs and data-race issues, thereby enabling reliable optimization of ML workloads on GPUs with tensor cores.

Abstract

With the rapid progress of deep learning and large language models (LLMs), companies now spend enormous sums executing GPU kernels. These kernels have, therefore, become prime targets for aggressive optimization. Recent efforts increasingly leverage LLMs to generate GPU kernels, but make no formal guarantees about the generated kernels. We present the first equivalence checker for GPU kernels and use it to formally verify the correctness of machine learning (ML) kernels optimized by hand, by LLMs, and by compilers. We show that our equivalence checker is sound and, for a well-defined class of GPU kernels which includes the programs of interest, complete. Our implementation, VOLTA, can verify ML computations such as convolutions, matrix multiplications, and various attention mechanisms.

Paper Structure

This paper contains 21 sections, 12 theorems, 11 equations, 3 figures, 6 tables.

Key Result

Lemma 1

If $(\mathcal{X}, \mathcal{G}, \mathcal{R}, P) \hookrightarrow^* (\mathcal{X}_1, \mathcal{G}_1, \mathcal{R}_1, P_1)$ and $(\mathcal{X}, \mathcal{G}, \mathcal{R}, P) \hookrightarrow^* (\mathcal{X}_2, \mathcal{G}_2, \mathcal{R}_2, P_2)$ then there exists $\mathcal{X}', \mathcal{G}', \mathcal{R}', P'$

Figures (3)

  • Figure 1: Naı̈ve softmax (left) requires materializing all exponentiations in shared memory; online softmax (right, used in FlashAttention) maintains a running max and normalization factor, enabling numerically stable and memory-efficient streaming.
  • Figure 2: Naïve softmax execution for N=4 threads. Each column shows the code executed by a single thread indexed by tid.
  • Figure 3: Syntax of $\mathcal{L}$, the language of structured-CTAs.

Theorems & Definitions (12)

  • Lemma 1: Confluence
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 1: Data Race Freedom
  • Theorem 2: True Positives Property
  • Lemma 5
  • Lemma 6
  • Theorem 3: Soundness of Equivalence Checking
  • Theorem 4: Completeness of Equivalence Checking
  • ...and 2 more