Table of Contents
Fetching ...

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li

TL;DR

MTMC addresses the challenge of automating high-performance GPU kernel generation by decoupling optimization strategy from low-level implementation. It introduces Macro Thinking, a reinforcement-learning-guided policy that proposes semantic optimizations, and Micro Coding, which incrementally implements these optimizations with general-purpose LLMs. Evaluations on KernelBench and TritonBench show MTMC achieving near-perfect correctness at easier levels and substantial speedups over both general-purpose LLMs and expert-optimized kernels, with robust cross-hardware generalization. This hierarchical framework suggests a scalable path toward portable, automatic kernel generation across diverse GPU architectures.

Abstract

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

TL;DR

MTMC addresses the challenge of automating high-performance GPU kernel generation by decoupling optimization strategy from low-level implementation. It introduces Macro Thinking, a reinforcement-learning-guided policy that proposes semantic optimizations, and Micro Coding, which incrementally implements these optimizations with general-purpose LLMs. Evaluations on KernelBench and TritonBench show MTMC achieving near-perfect correctness at easier levels and substantial speedups over both general-purpose LLMs and expert-optimized kernels, with robust cross-hardware generalization. This hierarchical framework suggests a scalable path toward portable, automatic kernel generation across diverse GPU architectures.

Abstract

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.

Paper Structure

This paper contains 16 sections, 4 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Comparison of GPU kernel generation paradigms.
  • Figure 2: MTMC overview. The framework takes unoptimized PyTorch code as input and generates high-performance GPU kernels with hierarchical process: Macro Thinking generates semantic optimization actions, while Micro Coding implements them step-by-step. The optimization policy based on lightweight LLMs is trained with RL on compact human-crafted dataset.