KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Kris Shengjun Dong; Sahil Modi; Dima Nikiforov; Sana Damani; Edward Lin; Siva Kumar Sastry Hari; Christos Kozyrakis

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis

TL;DR

KernelBlaster is proposed, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents and enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base.

Abstract

Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based agentic flow for CUDA generation and optimization to achieve high performance across generations of GPU architectures. KernelBlaster guides LLM agents to systematically explore high-potential optimization strategies beyond naive rewrites. Compared to the PyTorch baseline, our method achieves geometric mean speedups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively. We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation pipeline.

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 20 figures, 3 tables, 2 algorithms)

This paper contains 35 sections, 1 equation, 20 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Training-Based Solutions:
Static Prompt Engineering Solutions:
Search-Based Solutions:
Iterative Refinement on Prompting Policy:
Memory-Augmented Solutions:
Evolutionary Algorithms.
Semantic Learning via Textual Gradient Decent
In-Context Reinforcement Learning (ICRL):
Methodology
Evaluation
Evaluation Setup
Evaluation Metrics
Execution Harness
...and 20 more sections

Figures (20)

Figure 1: High-level block diagram of the KernelBlaster agentic workflow.
Figure 1: REINFORCE
Figure 2: Taxonomy of Agentic Flows for LLM-Driven Code Optimization.
Figure 3: Conceptual Model of Memory-Augmented In-context Reinforcement Learning (MAIC-RL) Across Tasks and Time.
Figure 4: Knowledge Base Construction.
...and 15 more figures

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

TL;DR

Abstract

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (20)