Table of Contents
Fetching ...

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie, Zhen Jia, Nandita Vijaykumar, Yida Wang, Kunle Olukotun

TL;DR

AccelOpt presents a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels for AWS Trainium by coupling beam-search exploration with an optimization memory that accumulates past slow–fast kernel experiences. The approach uses a planner/executor/summarizer trio to generate, validate, and generalized optimizations, with a distributed profiling service and a roofline-based NKIBench benchmark to quantify progress relative to hardware peak. Results show substantial improvements in average throughput from $49\%$ to $61\%$ on Trainium 1 and $45\%$ to $59\%$ on Trainium 2, while open-source models achieve competitive performance at roughly $26\times$ lower cost than Claude Sonnet 4. NKIBench provides a diverse, real-workload-derived kernel suite and a peak-performance metric, enabling a more holistic assessment of optimization progress beyond relative speedups. Overall, AccelOpt demonstrates the viability of automated, self-improving kernel optimization for emerging accelerators and highlights the practical benefits of memory-augmented search for low-expertise deployment.

Abstract

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

TL;DR

AccelOpt presents a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels for AWS Trainium by coupling beam-search exploration with an optimization memory that accumulates past slow–fast kernel experiences. The approach uses a planner/executor/summarizer trio to generate, validate, and generalized optimizations, with a distributed profiling service and a roofline-based NKIBench benchmark to quantify progress relative to hardware peak. Results show substantial improvements in average throughput from to on Trainium 1 and to on Trainium 2, while open-source models achieve competitive performance at roughly lower cost than Claude Sonnet 4. NKIBench provides a diverse, real-workload-derived kernel suite and a peak-performance metric, enabling a more holistic assessment of optimization progress beyond relative speedups. Overall, AccelOpt demonstrates the viability of automated, self-improving kernel optimization for emerging accelerators and highlights the practical benefits of memory-augmented search for low-expertise deployment.

Abstract

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from to on Trainium 1 and from to on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being cheaper.

Paper Structure

This paper contains 28 sections, 3 equations, 29 figures, 6 tables, 2 algorithms.

Figures (29)

  • Figure 1: At each iteration of AccelOpt, the agentic workflow shown on the right optimizes the candidate kernels with the latest optimization memory, and generates new candidate kernels, updating optimization memory with newly collected experiences. \ref{['sec:algorithm']} explains the overall workflow and each component in detail.
  • Figure 2: Prompt template for each agentic in the agentic workflow.
  • Figure 3: A snapshot of AccelOpt's execution trace. In the experience item, the pseudocode of the slow-fast pairs looks like the above candidate and optimized kernels where affine_range is a NKI construct for parallel loops without carried dependency. The experience item will be stored in the optimization memory, and the optimized kernel will become a candidate for the next iteration.
  • Figure 4: NKIBench architecture. Kernels are grouped by the configuration of ML operators. The meshes represent cores of one Trainium chip; trn1.32xlarge and trn2.48xlarge are Amazon EC2 instances for Trainium 1 and 2, respectively.
  • Figure 5: Per-task kernel improvement achieved using Claude Sonnet 4 and AccelOpt on Trainium 1.
  • ...and 24 more figures