Table of Contents
Fetching ...

NeCTAr: A Heterogeneous RISC-V SoC for Language Model Inference in Intel 16

Viansa Schmulbach, Jason Kim, Ethan Gao, Lucy Revina, Nikhil Jha, Ethan Wu, Borivoje Nikolic

TL;DR

The paper addresses the need for efficient edge inference of large language models by combining near-memory dense accelerators with near-core sparse accelerators in a 16nm heterogeneous RISC-V SoC. NeCTAr employs four in-order RISC-V cores, four near-memory compute engines, four CPU-coupled sparse accelerators, and a best-offset prefetcher, all integrated via a TileLink NoC and RoCC interfaces, demonstrated within an agile, 15-week tapeout flow. Key results include up to 132 GOPs/W efficiency, up to 6.02 GOPs peak INT8 performance, and up to 100x speedups for large matmul workloads, along with up to 45.4 infs/s/W and 1.28 infs/s for 1.7M ReLU-Llama inference, validating near-memory and near-core acceleration for evolving DL workloads. The work provides a reusable design framework and methodology for rapid prototyping of domain-specific accelerators on modern edge chips with potential impact on future ML inference platforms.

Abstract

This paper introduces NeCTAr (Near-Cache Transformer Accelerator), a 16nm heterogeneous multicore RISC-V SoC for sparse and dense machine learning kernels with both near-core and near-memory accelerators. A prototype chip runs at 400MHz at 0.85V and performs matrix-vector multiplications with 109 GOPs/W. The effectiveness of the design is demonstrated by running inference on a sparse language model, ReLU-Llama.

NeCTAr: A Heterogeneous RISC-V SoC for Language Model Inference in Intel 16

TL;DR

The paper addresses the need for efficient edge inference of large language models by combining near-memory dense accelerators with near-core sparse accelerators in a 16nm heterogeneous RISC-V SoC. NeCTAr employs four in-order RISC-V cores, four near-memory compute engines, four CPU-coupled sparse accelerators, and a best-offset prefetcher, all integrated via a TileLink NoC and RoCC interfaces, demonstrated within an agile, 15-week tapeout flow. Key results include up to 132 GOPs/W efficiency, up to 6.02 GOPs peak INT8 performance, and up to 100x speedups for large matmul workloads, along with up to 45.4 infs/s/W and 1.28 infs/s for 1.7M ReLU-Llama inference, validating near-memory and near-core acceleration for evolving DL workloads. The work provides a reusable design framework and methodology for rapid prototyping of domain-specific accelerators on modern edge chips with potential impact on future ML inference platforms.

Abstract

This paper introduces NeCTAr (Near-Cache Transformer Accelerator), a 16nm heterogeneous multicore RISC-V SoC for sparse and dense machine learning kernels with both near-core and near-memory accelerators. A prototype chip runs at 400MHz at 0.85V and performs matrix-vector multiplications with 109 GOPs/W. The effectiveness of the design is demonstrated by running inference on a sparse language model, ReLU-Llama.

Paper Structure

This paper contains 11 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: NeCTAr block diagram.
  • Figure 2: Chip specifications and die.
  • Figure 3: Clock tree diagram.
  • Figure 4: Near-memory compute engine architecture.
  • Figure 5: NMCE programming model for 256x4 (left) and 128x4 (right).
  • ...and 5 more figures