NOVA: NoC-based Vector Unit for Mapping Attention Layers on a CNN Accelerator
Mohit Upadhyay, Rohan Juneja, Weng-Fai Wong, Li-Shiuan Peh
TL;DR
NOVA introduces a NoC-based vector unit that performs on-chip non-linear approximation for attention layers by broadcasting slope and bias values across a NoC, enabling efficient overlay on existing edge accelerators. By replacing per-PE LUT storage with a line NoC broadcast, NOVA achieves substantial area, power, and energy savings while maintaining 1-cycle latency for common breakpoints. The approach is demonstrated across REACT, TPU v3/v4, and NVDLA architectures, showing up to $37.8\times$ power savings and several-fold area benefits over traditional LUT-based approximators, with notable energy improvements for BERT-like workloads. The work highlights the practical potential of using NoC wires for non-linear function approximation in attention-heavy models at the edge, enabling transformer-style workloads on compact accelerators with minimal hardware overhead.
Abstract
Attention mechanisms are becoming increasingly popular, being used in neural network models in multiple domains such as natural language processing (NLP) and vision applications, especially at the edge. However, attention layers are difficult to map onto existing neuro accelerators since they have a much higher density of non-linear operations, which lead to inefficient utilization of today's vector units. This work introduces NOVA, a NoC-based Vector Unit that can perform non-linear operations within the NoC of the accelerators, and can be overlaid onto existing neuro accelerators to map attention layers at the edge. Our results show that the NOVA architecture is up to 37.8x more power-efficient than state-of-the-art hardware approximators when running existing attention-based neural networks.
