Table of Contents
Fetching ...

NVR: Vector Runahead on NPUs for Sparse Memory Access

Hui Wang, Zhengpeng Zhao, Jing Wang, Yushu Du, Yuan Cheng, Bing Guo, He Xiao, Chenhao Ma, Xiaomeng Han, Dean You, Jiapeng Guan, Ran Wei, Dawei Yang, Zhe Jiang

TL;DR

Sparse DNN workloads on NPUs suffer from irregular memory accesses that cause severe cache misses and stall modern accelerators. NVR introduces a decoupled vector runahead prefetcher with modules for stride, indirect-pattern, and loop-bound reasoning, plus a micro-instruction generator and optional NSB to predict and prefetch data ahead of NPU execution. The approach achieves up to ~90% cache-miss reduction, substantial off-chip bandwidth savings, and up to ~5x gains when combined with a small NSB, while preserving a modest hardware footprint. These results demonstrate a practical, workload-driven method to accelerate sparse DNN and LLM inference on NPUs without compiler or algorithm changes, informing future architectural design for memory-bound AI workloads.

Abstract

Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.

NVR: Vector Runahead on NPUs for Sparse Memory Access

TL;DR

Sparse DNN workloads on NPUs suffer from irregular memory accesses that cause severe cache misses and stall modern accelerators. NVR introduces a decoupled vector runahead prefetcher with modules for stride, indirect-pattern, and loop-bound reasoning, plus a micro-instruction generator and optional NSB to predict and prefetch data ahead of NPU execution. The approach achieves up to ~90% cache-miss reduction, substantial off-chip bandwidth savings, and up to ~5x gains when combined with a small NSB, while preserving a modest hardware footprint. These results demonstrate a practical, workload-driven method to accelerate sparse DNN and LLM inference on NPUs without compiler or algorithm changes, informing future architectural design for memory-bound AI workloads.

Abstract

Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.

Paper Structure

This paper contains 22 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Sparsity is a widely adopted approach for speedup and energy efficiency through skipping zero in processing, which introduces substantial irregular memory accesses.
  • Figure 2: Sparse Matrix Multiplication can be categorised into one-side-sparsity and two-sides-sparsity patterns, with higher sparsity offering greater speedup potential at the cost of more challenging access patterns. Here, $\texttt{spatial\_for}$ denotes parallel operation on the NPU, while $\texttt{IA}$ (input activation), $\texttt{W}$ (weight), and $\texttt{OA}$ (output activation) represent the input variables, weight parameters, and output results, respectively.
  • Figure 3: NVR micro-architecture and components. Purple blocks represent NVR additions to the system. Red blocks indicate shared components between NVR and NPU, assisting speculative execution during NPU sparse unit idle periods. b SD: Stride Detector; c LBD: Loop Bound Detector; d SCD: Sparse Chain Detector; e VMIG: Vectorisation Micro-Instruction Generator; f NSB: Non-blocking Speculative Buffer.
  • Figure 4: Vectorisation micro-instruction generation pipeline. Micro-instruction 1-1 represents the first micro-instruction of instruction 1. Each micro-instruction loads an indeterminate number of data.
  • Figure 5: Normalised wall-clock time latency for each sparse workload. Within each group, each bar from left to right denotes execution in density, in order execution, OoO execution, IMP, DVR, and NVR, respectively. The lower segment indicates the NPU base execution time, whilst the upper segment represents the stall time caused by cache misses.
  • ...and 4 more figures