Table of Contents
Fetching ...

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding

TL;DR

CudaForge introduces a training-free, hardware-aware multi-agent workflow for CUDA kernel generation that leverages two specialized LLMs, a Coder and a Judge, guided by Nsight Compute metrics and GPU specifications. It achieves state-of-the-art results on KernelBench with 97.6% kernel correctness and an average speedup of about 1.68×, while generalizing across GPUs and base models at a low cost (~$0.30 per kernel and ~26.5 minutes). The framework's hardware-feedback loop provides targeted optimizations, outperforming RL-based and agentic baselines and maintaining robust performance as the number of refinement rounds increases. The work demonstrates that simple, hardware-guided, training-free workflows can deliver cost-effective, scalable CUDA kernel optimization with practical applicability and open-source access.

Abstract

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6\% correctness of generated kernels and an average 1.68$\times$ speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench.Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \$ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and \$ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at https://github.com/OptimAI-Lab/CudaForge

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

TL;DR

CudaForge introduces a training-free, hardware-aware multi-agent workflow for CUDA kernel generation that leverages two specialized LLMs, a Coder and a Judge, guided by Nsight Compute metrics and GPU specifications. It achieves state-of-the-art results on KernelBench with 97.6% kernel correctness and an average speedup of about 1.68×, while generalizing across GPUs and base models at a low cost (~$0.30 per kernel and ~26.5 minutes). The framework's hardware-feedback loop provides targeted optimizations, outperforming RL-based and agentic baselines and maintaining robust performance as the number of refinement rounds increases. The work demonstrates that simple, hardware-guided, training-free workflows can deliver cost-effective, scalable CUDA kernel optimization with practical applicability and open-source access.

Abstract

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6\% correctness of generated kernels and an average 1.68 speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench.Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at https://github.com/OptimAI-Lab/CudaForge

Paper Structure

This paper contains 34 sections, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: CudaForge achieves state-of-the-art results on KernelBench in both correctness and performance, surpassing RL-based methods such as Kevin-32B kevin-multi-turn-rl, the agentic baseline lange2025robustagenticcudakernel, and OpenAI-o3 openai2025o3systemcard. To further evaluate the effectiveness of our design, we additionally develop three customized variants of OpenAI-o3: o3-self-refine, o3-correction, and o3-optimization, which serve as baselines for ablation comparison. Scaling up maximum iteration rounds(CudaForge-Scaling Up) further improves CudaForge's performance to 2.27$\times$ speedup. Experimental details are provided in Section \ref{['exp']}.
  • Figure 2: Comparison between human and CudaForge workflows. Top: Human experts iteratively refine kernels by writing a prototype, testing it, and analyzing runtime feedback. Bottom: CudaForge mimics human workflow with two specialized agents (Coder and Judge). The Coder generates candidate kernels, while the Judge analyzes runtime info and hardware feedback to provide correction or optimization feedback. The process iterates until it reaches maximum round $N$.
  • Figure 3: The overview of how CudaForge optimizes kernels, compared with Kevin-32B. Top: the pipeline of the RL-based Kevin-32B, which relies solely on textual refinement and thus performs blind exploration. Bottom: our CudaForge workflow, which leverages hardware feedback to guide kernel optimization. When the Coder in CudaForge generates a correct candidate kernel in Round 1, the system profiles it using Nsight Compute (NCU) to obtain NCU metrics. In Round 2, the Judge analyzes these metrics and GPU specifications to identify performance bottlenecks (e.g., register- or memory-limited) and provides targeted optimization feedback. The Coder then refines the kernel accordingly. Compared with Kevin-32B, which only refines based on speedup scores, our framework achieves more interpretable and effective performance improvements through hardware-aware iteration.
  • Figure 4: Comparison of correctness and performance between CudaForge and the Agentic Baseline on KernelBench. Dashed lines denote average results of CudaForge over Level 1 and 2. CudaForge outperforms Agentic Baseline on KernelBench Level 1 and 2, and it also achieves strong performance in Level 3.
  • Figure 5: Comparison of correctness and performance between CudaForge and Kevin-32B on KernelBench. Dashed lines denote average results of CudaForge over Level 1 and 2. While training-free, CudaForge outperforms Kevin-32B in KernelBench Level 1-2, and gets outstanding results in Level 3.
  • ...and 4 more figures