Fast On-device LLM Inference with NPUs

Daliang Xu; Hao Zhang; Liming Yang; Ruiqi Liu; Gang Huang; Mengwei Xu; Xuanzhe Liu

Fast On-device LLM Inference with NPUs

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu

TL;DR

This work addresses the latency challenge of on-device LLM inference by introducing llm.npu, the first system to offload prefill workloads to mobile NPUs with three-level prompt/model reconstruction. It presents three novel techniques—chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution—to maximize NPU utilization while preserving full-model accuracy. Empirical results show substantial prefill speedups (up to 22.4x average), large energy savings (up to 30.7x), and end-to-end improvements (up to 32.8x) across multiple models and tasks, with sustained accuracy near FP16. The work demonstrates the practical viability of NPU-centric co-design for on-device LLMs and outlines hardware-aware future directions for mobile AI accelerators.

Abstract

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

Fast On-device LLM Inference with NPUs

TL;DR

Abstract

energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

Paper Structure (35 sections, 5 equations, 19 figures, 6 tables)

This paper contains 35 sections, 5 equations, 19 figures, 6 tables.

Introduction
Background
On-device LLM Inference Analysis
Opportunity: Mobile NPUs
Gaps between LLMs and Mobile NPUs
llm.npu Design
Overview of llm.npu
Chunk-sharing graph execution
Shadow outlier execution
Out-of-order subgraph execution
Implementation and Evaluation
Experiment setups
Prefill performance.
End-to-end performance
Inference accuracy
...and 20 more sections

Figures (19)

Figure 1: Breakdown of end-to-end inference latency for UI automation, context-aware QA, and chat summaries. CPU evaluation uses llama.cpp as the on-device inference engine, while GPU evaluation uses TFLite as the simulation engine.
Figure 2: The workflow of executing DNNs on mobile NPUs, with latencies for each procedure on QNN QNN.
Figure 3: Per-tensor quantization MatMul and per-group quantization MatMul. $seq$, $hds$, $group$ represent sequence length, hidden size, and group number, respectively.
Figure 4: The prefill latency and accuracy on HelloSwag datasets among different quantization algorithms atop Xiaomi 14 using Qualcomm QNN framework.
Figure 5: The workflow of quantized on-device LLM inference. Operations shown in blue are computed using INT4/INT8 formats, while those in orange are computed using float data formats.
...and 14 more figures

Fast On-device LLM Inference with NPUs

TL;DR

Abstract

Fast On-device LLM Inference with NPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (19)