Table of Contents
Fetching ...

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Martin Andrews, Sam Witteveen

TL;DR

The paper presents the GPU Kernel Scientist, an automated, LLM-driven framework designed to iteratively optimize GPU kernels for non-CUDA hardware (notably AMD MI300 HIP kernels) using only end-to-end timing feedback. It introduces a three-stage cycle—LLM Evolutionary Selector, LLM Experiment Designer, and LLM Kernel Writer—that collaboratively select seed code, plan experiments, and generate executable HIP kernels, evaluated in a restricted competition environment. Through targeted findings, it demonstrates how LLMs can bridge documentation gaps, compensate for limited profiling tools, and augment limited human expertise, achieving competitive performance despite harsh constraints. The work highlights the potential for democratizing high-performance kernel development while outlining practical limitations and avenues for future expansion to other hardware and richer feedback mechanisms.

Abstract

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU architectures where traditional development aids are scarce. This paper introduces an LLM-powered "GPU Kernel Scientist," an automated methodology for iteratively refining accelerator kernels. Our methodology employs LLMs in a multi-stage, evolutionary process: (a) strategically selecting promising prior code versions as a basis for new iterations; (b) generating hypotheses for optimization experiments, based on existing code and assimilated knowledge from general GPU literature; and (c) autonomously implementing these experiments through code modification and subsequent submission to an external evaluation system, using only observed timing data as performance feedback. We detail how this approach navigates the challenges of the AMD MI300 target architecture and leverages LLMs to compensate for limited domain-specific human expertise. In addition to our results, we present the architectural design, operational workflow, and qualitative insights, highlighting the potential of LLM-driven agents to democratise and accelerate GPU kernel optimization, especially in resource-constrained or rapidly updating hardware environment.

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

TL;DR

The paper presents the GPU Kernel Scientist, an automated, LLM-driven framework designed to iteratively optimize GPU kernels for non-CUDA hardware (notably AMD MI300 HIP kernels) using only end-to-end timing feedback. It introduces a three-stage cycle—LLM Evolutionary Selector, LLM Experiment Designer, and LLM Kernel Writer—that collaboratively select seed code, plan experiments, and generate executable HIP kernels, evaluated in a restricted competition environment. Through targeted findings, it demonstrates how LLMs can bridge documentation gaps, compensate for limited profiling tools, and augment limited human expertise, achieving competitive performance despite harsh constraints. The work highlights the potential for democratizing high-performance kernel development while outlining practical limitations and avenues for future expansion to other hardware and richer feedback mechanisms.

Abstract

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU architectures where traditional development aids are scarce. This paper introduces an LLM-powered "GPU Kernel Scientist," an automated methodology for iteratively refining accelerator kernels. Our methodology employs LLMs in a multi-stage, evolutionary process: (a) strategically selecting promising prior code versions as a basis for new iterations; (b) generating hypotheses for optimization experiments, based on existing code and assimilated knowledge from general GPU literature; and (c) autonomously implementing these experiments through code modification and subsequent submission to an external evaluation system, using only observed timing data as performance feedback. We detail how this approach navigates the challenges of the AMD MI300 target architecture and leverages LLMs to compensate for limited domain-specific human expertise. In addition to our results, we present the architectural design, operational workflow, and qualitative insights, highlighting the potential of LLM-driven agents to democratise and accelerate GPU kernel optimization, especially in resource-constrained or rapidly updating hardware environment.

Paper Structure

This paper contains 21 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: GPU Kernel Scientist Process