Table of Contents
Fetching ...

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He

TL;DR

This work systematically studies reinforcement learning for Triton kernel generation and introduces KernelGym, a robust, distributed environment with hacking checks and profiling feedback to support long-horizon RL. It addresses reward hacking and lazy optimization through TRLOO, mismatch rejection sampling, and bottleneck-aware profiling rewards, plus sequential test-time scaling to amplify inference. The Dr. Kernel family (notably Dr. Kernel-14B) achieves competitive speedups on KernelBench and can surpass frontier models under STTS, demonstrating meaningful real-world gains in kernel performance. The methods and datasets are released to enable future research and production-oriented automation in automated kernel generation.

Abstract

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

TL;DR

This work systematically studies reinforcement learning for Triton kernel generation and introduces KernelGym, a robust, distributed environment with hacking checks and profiling feedback to support long-horizon RL. It addresses reward hacking and lazy optimization through TRLOO, mismatch rejection sampling, and bottleneck-aware profiling rewards, plus sequential test-time scaling to amplify inference. The Dr. Kernel family (notably Dr. Kernel-14B) achieves competitive speedups on KernelBench and can surpass frontier models under STTS, demonstrating meaningful real-world gains in kernel performance. The methods and datasets are released to enable future research and production-oriented automation in automated kernel generation.

Abstract

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.
Paper Structure (51 sections, 14 equations, 14 figures, 2 tables)

This paper contains 51 sections, 14 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Rate of generated kernels achieving at least a $1.2\times$ speedup over the Torch reference on KernelBench across three level subsets. Dr. Kernel-14B is competitive with Claude-4.5-Sonnet and GPT-5, and applying sequential test-time scaling further improves Dr. Kernel-14B, surpassing both models on two of the three subsets.
  • Figure 2: Left: The plot uses a dual y-axis to compare two metrics. We report results from two models: Fast@1 of the model trained without reward hacking check (§ \ref{['sec:kernelgym:toolkits']}) (w/o hacking check), and Fast@1 / Fast@1.2 of the model trained with hacking check enabled. Evaluation is done using the same standard for all variants with hacking check. Multi-turn RL is run on Qwen3-8B-Base after cold-start SFT, using TRLOO for advantage estimation (§\ref{['sec:multi-turn-rl']}) and KernelGym as the execution environment (§\ref{['sec:kernelgym']}). Right: Representative cases illustrating reward hacking and lazy optimization. In Hacked_Kernel.py, the model emits a Triton kernel to satisfy the "@triton.jit" heuristic but never calls it, and additionally skips the real computation under the default training mode, inflating the measured speedup. In Lazy_Optimization.py, the model replaces only a trivial sub-operation (channel summation) with a kernel while leaving the remaining computation in Torch, missing the larger gains from fusion.
  • Figure 3: Overview of KernelGym and our training framework. Left: We study RL training methods for kernel generation, including multi-turn RL with TRLOO, profiling-based rewards (PR), and profiling-based rejection sampling (PRS). Right: The architecture of KernelGym: a server-worker split distributed design. The server side (interface + task manager) receives evaluation jobs and schedules to registered distributed GPU workers; each job runs in an isolated subprocess; toolkits produce structured signals for training, parallel evaluation and data collections.
  • Figure 4: Fast@1 on KernelBench Level 2. Left: Fast@1 at turn 3 over training steps. Right: Fast@1 across turns (evaluated at the selected checkpoint). Since all methods besides AutoTriton achieve their best performance at turn 3, we select checkpoints based on turn 3 performance. For AutoTriton we use their released model.
  • Figure 5: Left: Fast@1.2 at turn 3 over training steps. While MRS stabilizes training, profiling-based methods (PR and PRS) are required to significantly improve the stricter Fast@1.2 metric. Right: Entropy over training steps. While MRS improves training stability, PR and PRS further enhance stability on top of MRS. Additional training dynamics are shown in Figure \ref{['fig:training-dynamics']}.
  • ...and 9 more figures