Table of Contents
Fetching ...

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

Abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

Paper Structure

This paper contains 15 sections, 4 equations, 20 figures, 9 tables, 1 algorithm.

Figures (20)

  • Figure 1: The workflow of static compute graph of mobile NPUs. The latency is acquired on QNN SDK qnn by a basic matrix multiplication operation.
  • Figure 2: The attention score skewness of LLMs. Profiled on 128 samples from WikiText-2.
  • Figure 3: The workflow of sparse attention.
  • Figure 4: Latency breakdown.
  • Figure 5: Accuracy drop.
  • ...and 15 more figures