NVR: Vector Runahead on NPUs for Sparse Memory Access

Hui Wang; Zhengpeng Zhao; Jing Wang; Yushu Du; Yuan Cheng; Bing Guo; He Xiao; Chenhao Ma; Xiaomeng Han; Dean You; Jiapeng Guan; Ran Wei; Dawei Yang; Zhe Jiang

NVR: Vector Runahead on NPUs for Sparse Memory Access

Hui Wang, Zhengpeng Zhao, Jing Wang, Yushu Du, Yuan Cheng, Bing Guo, He Xiao, Chenhao Ma, Xiaomeng Han, Dean You, Jiapeng Guan, Ran Wei, Dawei Yang, Zhe Jiang

TL;DR

Sparse DNN workloads on NPUs suffer from irregular memory accesses that cause severe cache misses and stall modern accelerators. NVR introduces a decoupled vector runahead prefetcher with modules for stride, indirect-pattern, and loop-bound reasoning, plus a micro-instruction generator and optional NSB to predict and prefetch data ahead of NPU execution. The approach achieves up to ~90% cache-miss reduction, substantial off-chip bandwidth savings, and up to ~5x gains when combined with a small NSB, while preserving a modest hardware footprint. These results demonstrate a practical, workload-driven method to accelerate sparse DNN and LLM inference on NPUs without compiler or algorithm changes, informing future architectural design for memory-bound AI workloads.

Abstract

Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.

NVR: Vector Runahead on NPUs for Sparse Memory Access

TL;DR

Abstract

NVR: Vector Runahead on NPUs for Sparse Memory Access

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)