TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design
Jonghun Lee, Junghoon Lee, Hyeonjin Kim, Seoho Jeon, Jisup Yoon, Hyunbin Park, Meejeong Park, Heonjae Ha
TL;DR
TriGen tackles the bottleneck of end-to-end LLM inference on resource-constrained devices by marrying a mixed-precision MXINT8 data path with a LUT-based nonlinear processing pipeline and a resource-aware dataflow/scheduling strategy. The architecture integrates a lightweight RISC-V CP, four DLAs with a 32×32 MAC array, a TMU, and a 1 MiB SRAM, while software optimizations fuse operators and optimize tiling to minimize DRAM traffic. Key innovations include MXINT8 support for activations, a compact two-table LUT for nonlinear functions, and a co-design that jointly optimizes computation and data movement, achieving on average $2.73\times$ speedup and $52\%$ less memory transfer with negligible accuracy loss across multiple LLMs. This work demonstrates practical, scalable end-to-end on-device LLM acceleration, potentially enabling richer on-device AI capabilities under strict memory and bandwidth constraints.
Abstract
Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, thereby maximizing performance gains and reducing hardware-cost in on-device environments, and finally, by taking practical hardware constraints into account, further employs scheduling techniques to maximize computational utilization even under limited on-chip memory capacity. We evaluate the performance of TriGen on various LLMs and show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.
