Table of Contents
Fetching ...

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

Lingcheng Kong, Jiateng Wei, Hanzhang Shen, Huan Wang

TL;DR

The paper tackles the data scarcity barrier in open CUDA kernel generation by introducing ConCuR, a two-stage data synthesis and curation pipeline that pairs concise reasoning traces with CUDA kernels. Trained via LoRA on this curated dataset, KernelCoder achieves state-of-the-art performance on KernelBench, surpassing both open-source fine-tuned models and frontier LLMs. A key insight is that concise reasoning traces correlate with higher kernel quality, and the authors also propose using reasoning length as a metric to gauge task difficulty and benchmark rigor. The work provides a practical path to stronger kernel-generation models and lays groundwork for data-driven improvements in kernel design and evaluation.

Abstract

GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

TL;DR

The paper tackles the data scarcity barrier in open CUDA kernel generation by introducing ConCuR, a two-stage data synthesis and curation pipeline that pairs concise reasoning traces with CUDA kernels. Trained via LoRA on this curated dataset, KernelCoder achieves state-of-the-art performance on KernelBench, surpassing both open-source fine-tuned models and frontier LLMs. A key insight is that concise reasoning traces correlate with higher kernel quality, and the authors also propose using reasoning length as a metric to gauge task difficulty and benchmark rigor. The work provides a practical path to stronger kernel-generation models and lays groundwork for data-driven improvements in kernel design and evaluation.

Abstract

GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.

Paper Structure

This paper contains 25 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our two-stage data gathering pipeline. The first stage involves synthesizing CUDA kernels with corresponding CoTs and performing unit tests on each generated kernel to verify the correctness of the kernel and get the speedup over the torch eager implementation. The second stage is to select high-quality reasoning traces based on the criteria we claim in Section \ref{['sec:criteria']}.
  • Figure 2: Relationship between reasoning length and accuracy rate. (a): Boxplot of reasoning length distributions for correct and incorrect responses, indicating that incorrect responses generally involve longer reasoning. (b): Accuracy rate across reasoning length bins (blue bars) with corresponding sample counts (red line). The results indicate that shorter reasoning is generally associated with higher accuracy, whereas longer reasoning tends to reduce accuracy.
  • Figure 3: Scatter plot showing the relationship between reasoning length (tokens) and speedup over eager execution. A linear fit yields a correlation of $r = -0.047$ (Pearson correlation coefficient $p<0.01$), indicating that reasoning length has virtually no practical impact on performance.
  • Figure 4: Training statistics of KernelCoder on ConCuR.