Table of Contents
Fetching ...

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

TL;DR

CUDA-L1 presents a three-stage, contrastive reinforcement-learning framework that automates CUDA kernel optimization. By combining supervised fine-tuning with data augmentation, self-supervised learning, and contrastive RL, it discovers and composes high-impact CUDA techniques, achieving substantial speedups across 250 KernelBench kernels and multiple GPU architectures. Key contributions include a robust prompt-and-exemplar strategy, a speed-based reward design with defenses against reward hacking, and documented generalization to H100, L40, RTX 3090, and H20. The results demonstrate that RL can transform a weak foundation model into an effective CUDA optimizer, with large practical implications for reducing manual engineering effort and improving GPU efficiency.

Abstract

The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

TL;DR

CUDA-L1 presents a three-stage, contrastive reinforcement-learning framework that automates CUDA kernel optimization. By combining supervised fine-tuning with data augmentation, self-supervised learning, and contrastive RL, it discovers and composes high-impact CUDA techniques, achieving substantial speedups across 250 KernelBench kernels and multiple GPU architectures. Key contributions include a robust prompt-and-exemplar strategy, a speed-based reward design with defenses against reward hacking, and documented generalization to H100, L40, RTX 3090, and H20. The results demonstrate that RL can transform a weak foundation model into an effective CUDA optimizer, with large practical implications for reducing manual engineering effort and improving GPU efficiency.

Abstract

The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

Paper Structure

This paper contains 48 sections, 5 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Average speedup across different optimization configurations on 5 types of GPU architectures.
  • Figure 2: Overview of the CUDA-L1 training pipeline. The approach consists of three progressive stages: (1) Stage 1: Supervised Fine-tuning with Data Augmentation -- We augment the training dataset with CUDA code variants generated by LLMs and fine-tune the base model on executable and correct implementations to establish foundational CUDA knowledge. (2) Stage 2: Self-supervised Learning -- The model iteratively generates CUDA kernels, validates their correctness and executability, and trains on successfully validated examples, enabling autonomous improvement without human supervision. (3) Stage 3: Contrastive Reinforcement Learning -- We employ contrastive learning with execution-time rewards, training the model to distinguish between faster and slower CUDA implementations, ultimately optimizing for superior performance.
  • Figure : A case from KernelBench (Level 1, Task 12), computing diag(A) * B. We present reference code and CUDA-L1 implementation. The CUDA-L1 implementation reduces complexity from $O(N^2M)$ to $O(NM)$, achieving 64$\times$ speedup by replacing full matrix multiplication with element-wise operations.
  • Figure : A case from KernelBench (Level 1, Task 12), computing diag(A) * B. We present reference code and CUDA-L1 implementation. The CUDA-L1 implementation reduces complexity from $O(N^2M)$ to $O(NM)$, achieving 64$\times$ speedup by replacing full matrix multiplication with element-wise operations.