ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

Lingcheng Kong; Jiateng Wei; Hanzhang Shen; Huan Wang

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

Lingcheng Kong, Jiateng Wei, Hanzhang Shen, Huan Wang

TL;DR

The paper tackles the data scarcity barrier in open CUDA kernel generation by introducing ConCuR, a two-stage data synthesis and curation pipeline that pairs concise reasoning traces with CUDA kernels. Trained via LoRA on this curated dataset, KernelCoder achieves state-of-the-art performance on KernelBench, surpassing both open-source fine-tuned models and frontier LLMs. A key insight is that concise reasoning traces correlate with higher kernel quality, and the authors also propose using reasoning length as a metric to gauge task difficulty and benchmark rigor. The work provides a practical path to stronger kernel-generation models and lays groundwork for data-driven improvements in kernel design and evaluation.

Abstract

GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

TL;DR

Abstract

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)