AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie, Zhen Jia, Nandita Vijaykumar, Yida Wang, Kunle Olukotun
TL;DR
AccelOpt presents a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels for AWS Trainium by coupling beam-search exploration with an optimization memory that accumulates past slow–fast kernel experiences. The approach uses a planner/executor/summarizer trio to generate, validate, and generalized optimizations, with a distributed profiling service and a roofline-based NKIBench benchmark to quantify progress relative to hardware peak. Results show substantial improvements in average throughput from $49\%$ to $61\%$ on Trainium 1 and $45\%$ to $59\%$ on Trainium 2, while open-source models achieve competitive performance at roughly $26\times$ lower cost than Claude Sonnet 4. NKIBench provides a diverse, real-workload-derived kernel suite and a peak-performance metric, enabling a more holistic assessment of optimization progress beyond relative speedups. Overall, AccelOpt demonstrates the viability of automated, self-improving kernel optimization for emerging accelerators and highlights the practical benefits of memory-augmented search for low-expertise deployment.
Abstract
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
