TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Jonghun Lee; Junghoon Lee; Hyeonjin Kim; Seoho Jeon; Jisup Yoon; Hyunbin Park; Meejeong Park; Heonjae Ha

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, Heonjae Ha

TL;DR

TriGen tackles the bottleneck of end-to-end LLM inference on resource-constrained devices by marrying a mixed-precision MXINT8 data path with a LUT-based nonlinear processing pipeline and a resource-aware dataflow/scheduling strategy. The architecture integrates a lightweight RISC-V CP, four DLAs with a 32×32 MAC array, a TMU, and a 1 MiB SRAM, while software optimizations fuse operators and optimize tiling to minimize DRAM traffic. Key innovations include MXINT8 support for activations, a compact two-table LUT for nonlinear functions, and a co-design that jointly optimizes computation and data movement, achieving on average $2.73\times$ speedup and $52\%$ less memory transfer with negligible accuracy loss across multiple LLMs. This work demonstrates practical, scalable end-to-end on-device LLM acceleration, potentially enabling richer on-device AI capabilities under strict memory and bandwidth constraints.

Abstract

Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, thereby maximizing performance gains and reducing hardware-cost in on-device environments, and finally, by taking practical hardware constraints into account, further employs scheduling techniques to maximize computational utilization even under limited on-chip memory capacity. We evaluate the performance of TriGen on various LLMs and show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

TL;DR

speedup and

less memory transfer with negligible accuracy loss across multiple LLMs. This work demonstrates practical, scalable end-to-end on-device LLM acceleration, potentially enabling richer on-device AI capabilities under strict memory and bandwidth constraints.

Abstract

Paper Structure (23 sections, 9 equations, 14 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 9 equations, 14 figures, 6 tables, 2 algorithms.

Introduction
Background and Related Work
Challenges and Motivations
Architecture
Architecture Overview
Supported Data Types in TriGen
MAC Processing Array (MPA)
Post Processing Accelerator (PPA)
Lookup Table (LUT)
Support for Multi-NPUs
Software Optimizations
Operator Optimization
Dataflow and Tiling Strategy
Case Study: Deploying Llama on TriGen NPU
Normalize Layer
...and 8 more sections

Figures (14)

Figure 1: Structure of LLaMA3.2 LLM model
Figure 2: A tiling example of a matmul $A \times B = C$
Figure 3: Latency breakdown of LLMs according to the input sequence length from 512 to 4096. As the input sequence becomes longer, the portion of nonlinear function increases quadratically. When the input sequence length is 4k, nonlinear operations account for 20.5% and of end-to-end execution time on average.
Figure 4: Overview of TriGen architecture
Figure 5: FI32 data type of TriGen
...and 9 more figures

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

TL;DR

Abstract

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Authors

TL;DR

Abstract

Table of Contents

Figures (14)