Table of Contents
Fetching ...

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Robert Tjarko Lange, Qi Sun, Aaditya Prasad, Maxence Faldor, Yujin Tang, David Ha

TL;DR

The paper tackles the gap between high-level AI-assisted software engineering and low-level CUDA kernel optimization by introducing robust-kbench and an agentic pipeline that translates PyTorch code into CUDA kernels, optimizes their runtimes via evolutionary search, and verifies correctness with LLM-based soft verifiers. It demonstrates that, within a robust evaluation setting, the evolved CUDA kernels can outperform eager PyTorch implementations on practical forward and backward tasks, while also fusing operations and enhancing hardware verification efficiency. Key contributions include a robust benchmarking harness, a soft-verification workflow, and an end-to-end agentic framework with open-source benchmarks and kernels, enabling more generalizable kernel optimization under realistic testing conditions. This work advances practical GPU acceleration for ML workloads and offers a blueprint for reliable, reproducible kernel optimization in production pipelines.

Abstract

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing low-level CUDA kernel implementations. Additionally, existing kernel generation benchmarks suffer from exploitable loopholes and insufficient diversity in testing conditions, hindering true generalization assessment. To address these limitations, we introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios. Furthermore, we present a comprehensive agentic framework that automates CUDA kernel discovery, verification, and optimization. This pipeline enables frontier LLMs to translate torch code to CUDA kernels and iteratively improve their runtime within our robust evaluation setting. Our sequential workflow first translates PyTorch code into equivalent CUDA kernels. It then optimizes their runtime using a novel evolutionary meta-generation procedure tailored to the CUDA ecosystem, guided by LLM-based verifiers for correctness and efficient filtering. Evaluated on robust-kbench, our approach produces CUDA kernels outperforming torch implementations for practical applications, including forward and backward passes. It can fuse operations and deploy various runtime optimization strategies. The verifier workflow accurately classifies incorrect kernels, enhancing hardware verification efficiency.

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

TL;DR

The paper tackles the gap between high-level AI-assisted software engineering and low-level CUDA kernel optimization by introducing robust-kbench and an agentic pipeline that translates PyTorch code into CUDA kernels, optimizes their runtimes via evolutionary search, and verifies correctness with LLM-based soft verifiers. It demonstrates that, within a robust evaluation setting, the evolved CUDA kernels can outperform eager PyTorch implementations on practical forward and backward tasks, while also fusing operations and enhancing hardware verification efficiency. Key contributions include a robust benchmarking harness, a soft-verification workflow, and an end-to-end agentic framework with open-source benchmarks and kernels, enabling more generalizable kernel optimization under realistic testing conditions. This work advances practical GPU acceleration for ML workloads and offers a blueprint for reliable, reproducible kernel optimization in production pipelines.

Abstract

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing low-level CUDA kernel implementations. Additionally, existing kernel generation benchmarks suffer from exploitable loopholes and insufficient diversity in testing conditions, hindering true generalization assessment. To address these limitations, we introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios. Furthermore, we present a comprehensive agentic framework that automates CUDA kernel discovery, verification, and optimization. This pipeline enables frontier LLMs to translate torch code to CUDA kernels and iteratively improve their runtime within our robust evaluation setting. Our sequential workflow first translates PyTorch code into equivalent CUDA kernels. It then optimizes their runtime using a novel evolutionary meta-generation procedure tailored to the CUDA ecosystem, guided by LLM-based verifiers for correctness and efficient filtering. Evaluated on robust-kbench, our approach produces CUDA kernels outperforming torch implementations for practical applications, including forward and backward passes. It can fuse operations and deploy various runtime optimization strategies. The verifier workflow accurately classifies incorrect kernels, enhancing hardware verification efficiency.

Paper Structure

This paper contains 45 sections, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: High-level overview of the LLM-Driven CUDA Optimization & Core Results. Left: Functional PyTorch code is translated into a corresponding CUDA kernel, which is loaded to replace the PyTorch-eager operation. Middle: We use the translated kernel to initialize a runtime optimization process, which samples, verifies, tests, and evaluates a batch of kernels in parallel. Throughout, we use a series of language model-based verifiers to ensure correctness and efficient filtering of candidate kernels. Right: We demonstrate that our approach can accurately identify incorrect kernels (top) and discovers high-performing kernels (bottom) on the proposed robust-kbench. Runtime improvements are harder to achieve for backward than for forward kernel computations.
  • Figure 2: KernelBench Tasks.Left: Our proposed translation approach successfully translates 95% of all level 1 and level 2 KernelBench tasks. Incorporated LLM summarization of error messages outperforms simple parallel sampling. Middle: Our proposed agentic optimization framework significantly outperforms the Kevin-32B model, both evaluated on the full 200 tasks. After excluding contaminated tasks, the aggregated speedup significantly reduces. Right: Our evolutionary optimization approach displays test-time scaling behavior, discovering better speedups with more tries.
  • Figure 3: Verifier Prompt Tuning Pipeline & Results.Left: Overview of the LLM-based verifier prompt tuning workflow, where a dataset of kernel proposals is used to iteratively improve the LLM-based verifier's ability to detect errors. Right, Top: Accuracy results across generations for specialized verifiers targeting different types of CUDA errors: compilation, memory, and numerics. Right, Bottom: The tuned prompts generalize to different downstream verifier models.
  • Figure 4: Forward and Backward Pass Speedups. Speedup of LLM-optimized CUDA kernels, with and without verifier, and input shape information on twelve tasks. Forward pass achieves up to 2.5x speedup; backward pass yields smaller but consistent gains. The verifier improves stability and successful kernel evaluation (yellow). Adding input shape information to the system prompt can improve performance (green). Improvements scale with the number of kernel proposals.
  • Figure 5: Generalization of discovered kernels to unseen input shapes. We evaluate the optimized CUDA kernels on input shapes not seen during optimization. For LayerNorm and MNIST Linear-ReLU tasks, the kernels show signs of overfitting to the training configuration, with performance degrading on unseen shapes. For the ResNet block task we observe positive generalization, with the optimized kernels maintaining their performance benefits across different input dimensions.
  • ...and 5 more figures