AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Jaber Jaber; Osama Jaber

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Jaber Jaber, Osama Jaber

Abstract

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Abstract

Paper Structure (45 sections, 1 equation, 2 figures, 5 tables, 1 algorithm)

This paper contains 45 sections, 1 equation, 2 figures, 5 tables, 1 algorithm.

Introduction
Can LLMs write GPU kernels?
Our approach.
Key insight: optimize what matters.
Contributions.
Related Work
GPU Kernel Languages and Compilers
Optimized Kernel Libraries
LLM-based Kernel Generation
Autonomous Research Agents
System Design
Model Profiling (Phase A)
Kernel Extraction
The Agent Optimization Loop (Phase B)
Timing.
...and 30 more sections

Figures (2)

Figure 1: AutoKernel architecture. Phase A profiles the model and extracts bottleneck kernels. Phase B runs the autonomous optimization loop: the agent edits kernel.py, the benchmark verifies through five correctness stages, and the orchestrator decides to keep, revert, or move to the next kernel. Phase C verifies end-to-end correctness and speedup.
Figure 2: Five-stage correctness pipeline. Any failure immediately rejects the candidate. Throughput is only measured after all five stages pass. Each stage catches a distinct class of bugs.

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Abstract

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Authors

Abstract

Table of Contents

Figures (2)