Table of Contents
Fetching ...

NOVA: NoC-based Vector Unit for Mapping Attention Layers on a CNN Accelerator

Mohit Upadhyay, Rohan Juneja, Weng-Fai Wong, Li-Shiuan Peh

TL;DR

NOVA introduces a NoC-based vector unit that performs on-chip non-linear approximation for attention layers by broadcasting slope and bias values across a NoC, enabling efficient overlay on existing edge accelerators. By replacing per-PE LUT storage with a line NoC broadcast, NOVA achieves substantial area, power, and energy savings while maintaining 1-cycle latency for common breakpoints. The approach is demonstrated across REACT, TPU v3/v4, and NVDLA architectures, showing up to $37.8\times$ power savings and several-fold area benefits over traditional LUT-based approximators, with notable energy improvements for BERT-like workloads. The work highlights the practical potential of using NoC wires for non-linear function approximation in attention-heavy models at the edge, enabling transformer-style workloads on compact accelerators with minimal hardware overhead.

Abstract

Attention mechanisms are becoming increasingly popular, being used in neural network models in multiple domains such as natural language processing (NLP) and vision applications, especially at the edge. However, attention layers are difficult to map onto existing neuro accelerators since they have a much higher density of non-linear operations, which lead to inefficient utilization of today's vector units. This work introduces NOVA, a NoC-based Vector Unit that can perform non-linear operations within the NoC of the accelerators, and can be overlaid onto existing neuro accelerators to map attention layers at the edge. Our results show that the NOVA architecture is up to 37.8x more power-efficient than state-of-the-art hardware approximators when running existing attention-based neural networks.

NOVA: NoC-based Vector Unit for Mapping Attention Layers on a CNN Accelerator

TL;DR

NOVA introduces a NoC-based vector unit that performs on-chip non-linear approximation for attention layers by broadcasting slope and bias values across a NoC, enabling efficient overlay on existing edge accelerators. By replacing per-PE LUT storage with a line NoC broadcast, NOVA achieves substantial area, power, and energy savings while maintaining 1-cycle latency for common breakpoints. The approach is demonstrated across REACT, TPU v3/v4, and NVDLA architectures, showing up to power savings and several-fold area benefits over traditional LUT-based approximators, with notable energy improvements for BERT-like workloads. The work highlights the practical potential of using NoC wires for non-linear function approximation in attention-heavy models at the edge, enabling transformer-style workloads on compact accelerators with minimal hardware overhead.

Abstract

Attention mechanisms are becoming increasingly popular, being used in neural network models in multiple domains such as natural language processing (NLP) and vision applications, especially at the edge. However, attention layers are difficult to map onto existing neuro accelerators since they have a much higher density of non-linear operations, which lead to inefficient utilization of today's vector units. This work introduces NOVA, a NoC-based Vector Unit that can perform non-linear operations within the NoC of the accelerators, and can be overlaid onto existing neuro accelerators to map attention layers at the edge. Our results show that the NOVA architecture is up to 37.8x more power-efficient than state-of-the-art hardware approximators when running existing attention-based neural networks.
Paper Structure (27 sections, 8 figures, 4 tables)

This paper contains 27 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: LUT-based approximator (shared by 256 neurons)
  • Figure 2: Walkthrough of approximation with LUT-based baseline
  • Figure 3: Architecture of NOVA router with the comparator and MAC. Each router has two input and output links, connected in a 1D line topology. Each input and output link is 257 bits wide, encompassing 16 words (8 pairs of slope and bias values) along with their corresponding tag bit.
  • Figure 4: Walkthrough of approximation using NOVA NoC
  • Figure 5: Integrating NOVA with (a) REACT WS routers, (b) TPU v3/v4 MXU, (c) NVDLA Convolution Core
  • ...and 3 more figures