Table of Contents
Fetching ...

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, Depei Qian

TL;DR

PRAGMA addresses the challenge of coarse feedback in LLM-driven kernel generation by introducing a profile-guided multi-agent framework that reasons over hardware bottlenecks. It integrates a Profiler Agent (Nsight Compute, Linux Perf) and a Conductor Agent to translate profiling data into targeted optimizations, while preserving historical best kernels across iterations. On KernelBench, PRAGMA achieves substantial speedups across CPU and GPU backends compared with Torch and prior LLM baselines, including up to $2.81\times$ on CPU and $2.30\times$ on GPU, and up to $10.95\times$ over the profiling-free baseline in some cases. The work demonstrates a practical path toward autonomous, hardware-aware kernel optimization that generalizes across backends via a Triton-backed multi-backend approach.

Abstract

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$\times$ and 2.30$\times$ averaged speedups against Torch on CPU and GPU platforms, respectively.

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

TL;DR

PRAGMA addresses the challenge of coarse feedback in LLM-driven kernel generation by introducing a profile-guided multi-agent framework that reasons over hardware bottlenecks. It integrates a Profiler Agent (Nsight Compute, Linux Perf) and a Conductor Agent to translate profiling data into targeted optimizations, while preserving historical best kernels across iterations. On KernelBench, PRAGMA achieves substantial speedups across CPU and GPU backends compared with Torch and prior LLM baselines, including up to on CPU and on GPU, and up to over the profiling-free baseline in some cases. The work demonstrates a practical path toward autonomous, hardware-aware kernel optimization that generalizes across backends via a Triton-backed multi-backend approach.

Abstract

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81 and 2.30 averaged speedups against Torch on CPU and GPU platforms, respectively.

Paper Structure

This paper contains 15 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of PRAGMA. The code generation and optimization process shows the interaction between the Coder, Verifier, Profiler, and Conductor agents. The system begins with an input task description to Coder.
  • Figure 2: Profile-guided iterative optimization process. Coder agent leverages the suggestion provided by Conductor agent who receives the feedback from Verifier and Profiler to refine the code iteratively.
  • Figure 3: Performance of kernels from six KernelBench categories generated by PRAGMA and N-PRAGMA on CPU and GPU, with the reported speedup normalized to the Torch baseline.
  • Figure 4: Performance changes of PRAGMA and N-PRAGMA on KernelBench Max reduction over a dimension task over five consecutive attempts.