Table of Contents
Fetching ...

Kevin: Multi-Turn RL for Generating CUDA Kernels

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti

TL;DR

The paper presents Kevin, the first model trained with a flexible multi-turn RL framework for generating and optimizing CUDA kernels. By incorporating long-horizon trajectories, per-turn rewards, and summarized context, Kevin achieves substantial gains in kernel correctness and speedup over a strong base model and frontier models, while enabling meaningful test-time scaling through sequential refinement. The study analyzes reward aggregation strategies, instability proxies, and reward hacking, and demonstrates that multi-turn training yields faster improvement with more turns and better scaling under increasing test-time compute. Limitations include compute demands and reliance on a strong base model, with future work exploring richer value networks, PPO, and search-based test-time verification to broaden applicability beyond kernels.

Abstract

Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.

Kevin: Multi-Turn RL for Generating CUDA Kernels

TL;DR

The paper presents Kevin, the first model trained with a flexible multi-turn RL framework for generating and optimizing CUDA kernels. By incorporating long-horizon trajectories, per-turn rewards, and summarized context, Kevin achieves substantial gains in kernel correctness and speedup over a strong base model and frontier models, while enabling meaningful test-time scaling through sequential refinement. The study analyzes reward aggregation strategies, instability proxies, and reward hacking, and demonstrates that multi-turn training yields faster improvement with more turns and better scaling under increasing test-time compute. Limitations include compute demands and reliance on a strong base model, with future work exploring richer value networks, PPO, and search-based test-time verification to broaden applicability beyond kernels.

Abstract

Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.

Paper Structure

This paper contains 40 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Within each training step, the model iteratively generates, executes, and refines kernels over multiple turns. Kernels are rewarded individually, based both on their performance and their contribution to subsequent speedups: K1, for example, while incorrect, leads to both a correct, slow kernel, K2, and a correct, performant kernel, K3, and should thus be rewarded accordingly. This setup enables Kevin to learn advanced code generation strategies that span multiple turns. Note: CoT' is the summarized chain of thought (CoT).
  • Figure 2: Reward plateaus during single-turn training. We trained up to step 50 (100 gradient steps).
  • Figure 3: Sum with $\gamma=0.4$ is the most effective reward formulation. Here we evaluate models trained with different reward formulations (Sum vs Max aggregation across turns and discount factor $\gamma = 0.4$ vs $\gamma = 0.8$) with 16 parallel trajectories and 8 refinement turns. We compare how each setup scale with refinement turns. Though Sum with $\gamma = 0.4$ achieves lower performance and correctness in the first turn, it exhibits the best scaling behavior overall.
  • Figure 4: Reward climbs steadily for multi-turn training. We train up to 40 steps (80 gradient steps).
  • Figure 5: Kevin effectively leverages multiple turns. We evaluate the above checkpoints under the same environment with 16 parallel trajectories and 8 refinement turns. We observe that both Kevin and single-turn RL model significantly improves upon QwQ-32B, but the performance curve for Kevin is steeper than the single-turn model.
  • ...and 8 more figures