TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator

Deepak Vungarala; Mohammed E. Elbtity; Sumiya Syed; Sakila Alam; Kartik Pandit; Arnob Ghosh; Ramtin Zand; Shaahin Angizi

TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator

Deepak Vungarala, Mohammed E. Elbtity, Sumiya Syed, Sakila Alam, Kartik Pandit, Arnob Ghosh, Ramtin Zand, Shaahin Angizi

TL;DR

TPU-Gen tackles the challenge of automated TPU design by combining domain-specific LLMs with a curated dataset and a Retrieval-Augmented Generation (RAG) pipeline to reduce hallucinations. It introduces a parameterized architectural template for a systolic-array TPU with Output-Stationary dataflow and a highly parameterized RTL library to support varying systolic size $S$, bit width $n$, and weight width $W$. The curated dataset includes 29,952 architectural variations and 25,000 datapoints across 8 implementations, enabling reuse and adaptation for diverse DNN workloads. Experimental results show that TPU-Gen achieves substantial improvements in area and power (average reductions of 92% and 96%, respectively) over manual optimization, with RAG playing a central role in maintaining correctness and reducing hallucinations. The work suggests a practical route to open, scalable, LLM-assisted hardware design and plans to release datasets and fine-tuned models publicly.

Abstract

The increasing complexity and scale of Deep Neural Networks (DNNs) necessitate specialized tensor accelerators, such as Tensor Processing Units (TPUs), to meet various computational and energy efficiency requirements. Nevertheless, designing optimal TPU remains challenging due to the high domain expertise level, considerable manual design time, and lack of high-quality, domain-specific datasets. This paper introduces TPU-Gen, the first Large Language Model (LLM) based framework designed to automate the exact and approximate TPU generation process, focusing on systolic array architectures. TPU-Gen is supported with a meticulously curated, comprehensive, and open-source dataset that covers a wide range of spatial array designs and approximate multiply-and-accumulate units, enabling design reuse, adaptation, and customization for different DNN workloads. The proposed framework leverages Retrieval-Augmented Generation (RAG) as an effective solution for a data-scare hardware domain in building LLMs, addressing the most intriguing issue, hallucinations. TPU-Gen transforms high-level architectural specifications into optimized low-level implementations through an effective hardware generation pipeline. Our extensive experimental evaluations demonstrate superior performance, power, and area efficiency, with an average reduction in area and power of 92\% and 96\% from the manual optimization reference values. These results set new standards for driving advancements in next-generation design automation tools powered by LLMs.

TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator

TL;DR

Abstract

TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)