Table of Contents
Fetching ...

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang

TL;DR

This work targets CUDA kernel generation with diffusion large language models by addressing data scarcity through the CuKe dataset and introducing BiC-RL, a two-stage reinforcement learning framework that first teaches kernel infilling and then end-to-end generation. The resulting DICE models (1.7B, 4B, 8B) achieve state-of-the-art performance on KernelBench, often matching or surpassing larger autoregressive and diffusion peers while demonstrating strong robustness and reduced deceptive behavior. By grounding generation in a structured kernel scaffold and progressive training, the approach yields functionally correct, high-speed CUDA kernels and offers a practical path toward scalable HPC code optimization. The work thus advances diffusion-based code generation for specialized hardware and provides a data-efficient, reproducible pipeline for high-performance kernel development.

Abstract

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

TL;DR

This work targets CUDA kernel generation with diffusion large language models by addressing data scarcity through the CuKe dataset and introducing BiC-RL, a two-stage reinforcement learning framework that first teaches kernel infilling and then end-to-end generation. The resulting DICE models (1.7B, 4B, 8B) achieve state-of-the-art performance on KernelBench, often matching or surpassing larger autoregressive and diffusion peers while demonstrating strong robustness and reduced deceptive behavior. By grounding generation in a structured kernel scaffold and progressive training, the approach yields functionally correct, high-speed CUDA kernels and offers a practical path toward scalable HPC code optimization. The work thus advances diffusion-based code generation for specialized hardware and provides a data-efficient, reproducible pipeline for high-performance kernel development.

Abstract

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
Paper Structure (21 sections, 3 equations, 5 figures, 6 tables)

This paper contains 21 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of DICE. The framework enhances CUDA kernel generation robustness in dLLMs by leveraging TraceRL. This hierarchical approach integrates: (1) Bi-phase Curated Reinforcement Learning framework, a progressive RL training strategy that consists of kernel infilling and end-to-end kernel generation stages to ensure functional correctness and high performance of generated CUDA kernels, and (2) Data Scheduling, transitioning training data from basic single operations to complex whole-model structures during the two RL stages. A valid reward will only be returned when the generated CUDA kernel can be compiled and functions correctly.
  • Figure 2: The inference paradigm of diffusion large language models. Left Part: The sequence is divided into several blocks, where the block length equals four in this figure. The block diffusion mechanism enables models to generate autoregressively between blocks, while parallel discrete decoding within blocks. All the KV cache from previous blocks will be reused. Right Part: An actual step-by-step generation trajectory for an example CUDA kernel. While the overall trend remains autoregressive, we can clearly observe lots of non-autoregressive behavior during the generation process.
  • Figure 3: Our defined CUDA kernel components: the prefix, the suffix, and the core implementation, which is a C++ snippet.
  • Figure 4: RL training trajectory comparison of BiC-RL framework and baseline RL on 8B model.
  • Figure 5: Comparison of correctness trends for BiC-RL and baseline RL of 8B model on KernelBench Level 2.