Autocomp: A Powerful and Portable Code Optimizer for Tensor Accelerators
Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao
TL;DR
Autocomp tackles the challenge of optimizing low-resource tensor accelerators by introducing an LLM-driven, portable two-phase optimization flow that plans optimizations and then implements them, guided by hardware correctness and performance feedback. A beam-search framework with diversity strategies and a schedule-reuse capability explores multiple optimization trajectories across Gemmini, Trainium, and NVIDIA L40S backends, achieving substantial speedups over vendor libraries, hand-tuned baselines, and ML-based models. The approach demonstrates strong cross-platform portability, enabling retargeting through prompts rather than backend code, and shows that generated schedules can generalize to similar tensor operations to reduce search cost. The results indicate significant practical impact for accelerator developers, with open-source tooling and prompts enabling rapid adaptation to new hardware. Overall, Autocomp delivers performance gains across diverse workloads and architectures, highlighting the viability of LLM-guided, prompt-driven optimization in specialized hardware contexts.
Abstract
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today's computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages, such as specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three distinct hardware platforms, we demonstrate that Autocomp-optimized code runs 5.6x faster than the vendor-provided library (Gemmini), outperforms expert-level hand-tuned code by 1.9x (AWS Trainium), and achieves 3.8x higher performance than a machine learning-based cost model for GPUs (NVIDIA L40S). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
