Table of Contents
Fetching ...

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Tara Saba, Anne Ouyang, Xujie Si, Fan Long

Abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.

Paper Structure

This paper contains 36 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: CuTeGen high-level architecture showing iterative kernel generation, correctness assurance, and optimization.
  • Figure 2: Initial synthesis prompt used to generate the first candidate kernel.
  • Figure 3: Detailed view of the correctness assurance component. Generated kernels are compiled and validated against reference outputs, and all errors are recorded. The model generates structured patches that are applied iteratively until correctness is achieved or the generation budget is exhausted. Surrounding stages are omitted for clarity.
  • Figure 4: Detailed view of the optimization component. Profiling metrics are incorporated into the prompt when enabled to guide performance-aware transformations. Surrounding stages are omitted for clarity.
  • Figure 5: Effect of profiling timing on optimization behavior. (a) Profiling feedback provided from the initial optimization iteration (depth 1). (b) Profiling feedback introduced after structural optimization (depth 11). Delaying profiling produces a stronger optimization trajectory and improved final performance.
  • ...and 6 more figures