Table of Contents
Fetching ...

LLM-Aided Compilation for Tensor Accelerators

Charles Hong, Sahil Bhatia, Altan Haan, Shengjun Kris Dong, Dima Nikiforov, Alvin Cheung, Yakun Sophia Shao

TL;DR

This work investigates using large language models to build an agile compiler flow for tensor accelerators, addressing the DSL- and hardware-change battle by splitting compilation into a functional translation phase and a cost-model-driven optimization phase. It demonstrates that GPT-4 can translate Robotics kernels to the Gemmini ISA, and that decomposing translation into structured, semi-structured steps improves success and performance. The methodology combines code template generation with LLM prompts and a feedback loop from hardware cost models to iteratively refine both correctness and performance. Through experiments on model-predictive control and Riccati recursion, the authors show promising results in translating, repairing, and optimizing TA-targeted code, suggesting a viable path toward automated hardware-software co-design. The practical impact is a more agile, scalable framework for developing and evaluating tensor accelerators across diverse domains beyond deep learning, with potential time-to-market and design-space exploration benefits.

Abstract

Hardware accelerators, in particular accelerators for tensor processing, have many potential application domains. However, they currently lack the software infrastructure to support the majority of domains outside of deep learning. Furthermore, a compiler that can easily be updated to reflect changes at both application and hardware levels would enable more agile development and design space exploration of accelerators, allowing hardware designers to realize closer-to-optimal performance. In this work, we discuss how large language models (LLMs) could be leveraged to build such a compiler. Specifically, we demonstrate the ability of GPT-4 to achieve high pass rates in translating code to the Gemmini accelerator, and prototype a technique for decomposing translation into smaller, more LLM-friendly steps. Additionally, we propose a 2-phase workflow for utilizing LLMs to generate hardware-optimized code.

LLM-Aided Compilation for Tensor Accelerators

TL;DR

This work investigates using large language models to build an agile compiler flow for tensor accelerators, addressing the DSL- and hardware-change battle by splitting compilation into a functional translation phase and a cost-model-driven optimization phase. It demonstrates that GPT-4 can translate Robotics kernels to the Gemmini ISA, and that decomposing translation into structured, semi-structured steps improves success and performance. The methodology combines code template generation with LLM prompts and a feedback loop from hardware cost models to iteratively refine both correctness and performance. Through experiments on model-predictive control and Riccati recursion, the authors show promising results in translating, repairing, and optimizing TA-targeted code, suggesting a viable path toward automated hardware-software co-design. The practical impact is a more agile, scalable framework for developing and evaluating tensor accelerators across diverse domains beyond deep learning, with potential time-to-market and design-space exploration benefits.

Abstract

Hardware accelerators, in particular accelerators for tensor processing, have many potential application domains. However, they currently lack the software infrastructure to support the majority of domains outside of deep learning. Furthermore, a compiler that can easily be updated to reflect changes at both application and hardware levels would enable more agile development and design space exploration of accelerators, allowing hardware designers to realize closer-to-optimal performance. In this work, we discuss how large language models (LLMs) could be leveraged to build such a compiler. Specifically, we demonstrate the ability of GPT-4 to achieve high pass rates in translating code to the Gemmini accelerator, and prototype a technique for decomposing translation into smaller, more LLM-friendly steps. Additionally, we propose a 2-phase workflow for utilizing LLMs to generate hardware-optimized code.
Paper Structure (17 sections, 14 figures, 3 tables)

This paper contains 17 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: An overview of our proposed framework.
  • Figure 2: Gemmini ISA specification from Section \ref{['sec:experiments_gemmini']}.
  • Figure 3: Code translation task description, as described in Sections \ref{['sec:translation_proposed']} and \ref{['sec:experiments_gemmini']}.
  • Figure 4: Task description for optimizing blocks of code.
  • Figure 5: Task description for reordering the blocks of unoptimized code for generating optimized code.
  • ...and 9 more figures