Table of Contents
Fetching ...

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, Azalia Mirhoseini

TL;DR

<3-5 sentence high-level summary> KernelBench investigates whether language models can automatically generate GPU kernels that are both functionally correct and fast on real-world PyTorch workloads. It introduces a 250-task benchmark, a fast_p metric combining correctness and speedups, and a workflow that accepts rich hardware-specific information and feedback. Across multiple models and test-time methods, frontier reasoning models show limited out-of-the-box success, with iterative refinement and execution/profiling feedback delivering the most substantial gains yet still facing substantial limitations, especially across hardware. The work provides an open-source framework, analyzes failure modes, and outlines concrete avenues—data, prompting, and tooling—for advancing LM-driven kernel optimization with tangible production impact on energy and cost efficiency.</p>

Abstract

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

KernelBench: Can LLMs Write Efficient GPU Kernels?

TL;DR

<3-5 sentence high-level summary> KernelBench investigates whether language models can automatically generate GPU kernels that are both functionally correct and fast on real-world PyTorch workloads. It introduces a 250-task benchmark, a fast_p metric combining correctness and speedups, and a workflow that accepts rich hardware-specific information and feedback. Across multiple models and test-time methods, frontier reasoning models show limited out-of-the-box success, with iterative refinement and execution/profiling feedback delivering the most substantial gains yet still facing substantial limitations, especially across hardware. The work provides an open-source framework, analyzes failure modes, and outlines concrete avenues—data, prompting, and tooling—for advancing LM-driven kernel optimization with tangible production impact on energy and cost efficiency.</p>

Abstract

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.
Paper Structure (55 sections, 1 equation, 13 figures, 14 tables)

This paper contains 55 sections, 1 equation, 13 figures, 14 tables.

Figures (13)

  • Figure 1: KernelBench evaluates LMs' ability to generate performant GPU Kernels. Overview of tasks in KernelBench: KernelBench tasks LMs with generating optimized CUDA kernels for a given target PyTorch model architecture and conducts automated evaluation
  • Figure 2: KernelBench is a challenging benchmark for current LMs. Here we present $\text{fast}_{1}$, i.e. the percentage of problems where the model-generated kernel is faster than the PyTorch Eager and torch.compile baseline (default configuration) on NVIDIA L40S.
  • Figure 3: We categorize failure modes of kernel code into execution failure and functional correctness. For the one-shot baseline, reasoning models generate fewer kernels with execution failures, but all models struggle similarly with functional correctness.
  • Figure 4: Most LM-generated kernels are slow. This figure shows the distribution of the $\text{fast}_{p}$ metric as the speedup threshold $p$ (over PyTorch baseline) increases. $\text{fast}_{0}$ represents the number of correct kernels regardless of speed, and $\text{fast}_{1}$ represents the number of correct kernels achieving at least $> 1\times$ speedup over PyTorch. Increasing the threshold $p$ increases the difficulty.
  • Figure 5: Repeated sampling helps discover more correct and performant kernels. As the number of repeated samples $k$ increases (up to 100), we observe that $\text{fast}_1$@k improves for both DeepSeek-V3 and Llama 3.1-70B Instruct across all 3 KernelBench levels. We also observe a larger increase in correct solutions for Level 2 kernels.
  • ...and 8 more figures