Table of Contents
Fetching ...

Fast On-device LLM Inference with NPUs

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu

TL;DR

This work addresses the latency challenge of on-device LLM inference by introducing llm.npu, the first system to offload prefill workloads to mobile NPUs with three-level prompt/model reconstruction. It presents three novel techniques—chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution—to maximize NPU utilization while preserving full-model accuracy. Empirical results show substantial prefill speedups (up to 22.4x average), large energy savings (up to 30.7x), and end-to-end improvements (up to 32.8x) across multiple models and tasks, with sustained accuracy near FP16. The work demonstrates the practical viability of NPU-centric co-design for on-device LLMs and outlines hardware-aware future directions for mobile AI accelerators.

Abstract

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

Fast On-device LLM Inference with NPUs

TL;DR

This work addresses the latency challenge of on-device LLM inference by introducing llm.npu, the first system to offload prefill workloads to mobile NPUs with three-level prompt/model reconstruction. It presents three novel techniques—chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution—to maximize NPU utilization while preserving full-model accuracy. Empirical results show substantial prefill speedups (up to 22.4x average), large energy savings (up to 30.7x), and end-to-end improvements (up to 32.8x) across multiple models and tasks, with sustained accuracy near FP16. The work demonstrates the practical viability of NPU-centric co-design for on-device LLMs and outlines hardware-aware future directions for mobile AI accelerators.

Abstract

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7 energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.
Paper Structure (35 sections, 5 equations, 19 figures, 6 tables)

This paper contains 35 sections, 5 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Breakdown of end-to-end inference latency for UI automation, context-aware QA, and chat summaries. CPU evaluation uses llama.cpp as the on-device inference engine, while GPU evaluation uses TFLite as the simulation engine.
  • Figure 2: The workflow of executing DNNs on mobile NPUs, with latencies for each procedure on QNN QNN.
  • Figure 3: Per-tensor quantization MatMul and per-group quantization MatMul. $seq$, $hds$, $group$ represent sequence length, hidden size, and group number, respectively.
  • Figure 4: The prefill latency and accuracy on HelloSwag datasets among different quantization algorithms atop Xiaomi 14 using Qualcomm QNN framework.
  • Figure 5: The workflow of quantized on-device LLM inference. Operations shown in blue are computed using INT4/INT8 formats, while those in orange are computed using float data formats.
  • ...and 14 more figures