NeCTAr: A Heterogeneous RISC-V SoC for Language Model Inference in Intel 16
Viansa Schmulbach, Jason Kim, Ethan Gao, Lucy Revina, Nikhil Jha, Ethan Wu, Borivoje Nikolic
TL;DR
The paper addresses the need for efficient edge inference of large language models by combining near-memory dense accelerators with near-core sparse accelerators in a 16nm heterogeneous RISC-V SoC. NeCTAr employs four in-order RISC-V cores, four near-memory compute engines, four CPU-coupled sparse accelerators, and a best-offset prefetcher, all integrated via a TileLink NoC and RoCC interfaces, demonstrated within an agile, 15-week tapeout flow. Key results include up to 132 GOPs/W efficiency, up to 6.02 GOPs peak INT8 performance, and up to 100x speedups for large matmul workloads, along with up to 45.4 infs/s/W and 1.28 infs/s for 1.7M ReLU-Llama inference, validating near-memory and near-core acceleration for evolving DL workloads. The work provides a reusable design framework and methodology for rapid prototyping of domain-specific accelerators on modern edge chips with potential impact on future ML inference platforms.
Abstract
This paper introduces NeCTAr (Near-Cache Transformer Accelerator), a 16nm heterogeneous multicore RISC-V SoC for sparse and dense machine learning kernels with both near-core and near-memory accelerators. A prototype chip runs at 400MHz at 0.85V and performs matrix-vector multiplications with 109 GOPs/W. The effectiveness of the design is demonstrated by running inference on a sparse language model, ReLU-Llama.
