Table of Contents
Fetching ...

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

Jinwu Chen, Qidie Wu, Bin Li, Lin Ma, Xin Si, Yang Hu, Shouyi Yin, Jun Yang

TL;DR

CUDA kernel optimization is a labor-intensive, hardware-specific challenge poorly addressed by prior LLM+evolution methods due to mismatches in representation and evaluation. cuPilot introduces a Strategy-Coordinated multi-agent framework that treats strategy as an intermediate representation, featuring a Strategy-Coordinated Evolution (SCE) algorithm, roofline-guided prompting, and strategy-level population initialization. Empirically, cuPilot delivers an average 3.09× speedup over PyTorch across 100 KernelBench kernels and up to 4.06× on GEMM workloads, with ablations confirming the value of roofline guidance and history-informed initialization. The generated kernels are open-sourced, underscoring practical impact for accelerating CUDA kernel development and optimization.

Abstract

Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

TL;DR

CUDA kernel optimization is a labor-intensive, hardware-specific challenge poorly addressed by prior LLM+evolution methods due to mismatches in representation and evaluation. cuPilot introduces a Strategy-Coordinated multi-agent framework that treats strategy as an intermediate representation, featuring a Strategy-Coordinated Evolution (SCE) algorithm, roofline-guided prompting, and strategy-level population initialization. Empirically, cuPilot delivers an average 3.09× speedup over PyTorch across 100 KernelBench kernels and up to 4.06× on GEMM workloads, with ablations confirming the value of roofline guidance and history-informed initialization. The generated kernels are open-sourced, underscoring practical impact for accelerating CUDA kernel development and optimization.

Abstract

Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09 over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.

Paper Structure

This paper contains 16 sections, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: The impact of parent kernels' complexity on generated child kernels during conventional crossover prompting.
  • Figure 2: Overview of cuPilot multi-agent framework. Three key contributions are illustrated: SCE algorithm, roofline-guided prompting, and strategy-level population initialization.
  • Figure 3: (a) Kernels positioned in GPU's roofline model. (b) Prompting examples for compute/memory-bound kernels.
  • Figure 4: Historical data pair format for StrategyApplication prompt and database construction for RAG.
  • Figure 5: Performance comparison of cuPilot, PyTorch, and AI CUDA Engineer on Kernelbench benchmark.
  • ...and 3 more figures