Fast On-device LLM Inference with NPUs
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu
TL;DR
This work addresses the latency challenge of on-device LLM inference by introducing llm.npu, the first system to offload prefill workloads to mobile NPUs with three-level prompt/model reconstruction. It presents three novel techniques—chunk-sharing graphs, shadow outlier execution, and out-of-order subgraph execution—to maximize NPU utilization while preserving full-model accuracy. Empirical results show substantial prefill speedups (up to 22.4x average), large energy savings (up to 30.7x), and end-to-end improvements (up to 32.8x) across multiple models and tasks, with sustained accuracy near FP16. The work demonstrates the practical viability of NPU-centric co-design for on-device LLMs and outlines hardware-aware future directions for mobile AI accelerators.
Abstract
On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.
