Table of Contents
Fetching ...

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, Haibo Chen

TL;DR

PowerInfer-2 tackles the challenge of running large language models on memory-constrained smartphones by introducing a neuron-cluster abstraction and two core principles: Sparsity-Aware Adaptation and I/O-Aware Orchestration. It combines offline planning with an adaptive online engine that splits work between NPU and CPU and pipelines computation with I/O, aided by an in-memory neuron cache. The system achieves substantial speedups (up to 27.8x over llama.cpp) and enables 47B models on mobile devices with minimal accuracy loss, while also reducing memory usage and energy per token. This mobile-centric approach closes the gap to on-device LLMs, enabling private, offline, real-time AI assistance on widely available hardware.

Abstract

Large language models (LLMs) on smartphones enable real-time AI assistance and privacy-preserving, offline operation. However, resource constraints of smartphones limit current deployments to small language models (SLMs), significantly compromising their capabilities. This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity. The key insight is decomposing matrix operations into neuron clusters as the basic processing unit, which enables flexible scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages this neuron-cluster-based design in both computation and storage. For computation, neuron clusters with dense activations are processed on NPU, while sparse clusters use CPU. The storage engine provides a fine-grained pipeline mechanism that coordinates cluster-level computation and I/O operations, enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2 achieves up to a 27.8x speed increase compared to state-of-the-art frameworks. PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

TL;DR

PowerInfer-2 tackles the challenge of running large language models on memory-constrained smartphones by introducing a neuron-cluster abstraction and two core principles: Sparsity-Aware Adaptation and I/O-Aware Orchestration. It combines offline planning with an adaptive online engine that splits work between NPU and CPU and pipelines computation with I/O, aided by an in-memory neuron cache. The system achieves substantial speedups (up to 27.8x over llama.cpp) and enables 47B models on mobile devices with minimal accuracy loss, while also reducing memory usage and energy per token. This mobile-centric approach closes the gap to on-device LLMs, enabling private, offline, real-time AI assistance on widely available hardware.

Abstract

Large language models (LLMs) on smartphones enable real-time AI assistance and privacy-preserving, offline operation. However, resource constraints of smartphones limit current deployments to small language models (SLMs), significantly compromising their capabilities. This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity. The key insight is decomposing matrix operations into neuron clusters as the basic processing unit, which enables flexible scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages this neuron-cluster-based design in both computation and storage. For computation, neuron clusters with dense activations are processed on NPU, while sparse clusters use CPU. The storage engine provides a fine-grained pipeline mechanism that coordinates cluster-level computation and I/O operations, enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2 achieves up to a 27.8x speed increase compared to state-of-the-art frameworks. PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.
Paper Structure (36 sections, 14 figures, 8 tables)

This paper contains 36 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Comparison of sampling strategies used in LLM inference. (a) Basic sampling generates a single response directly from the prompt. (b) Best-of-N sampling generates multiple candidate response sequences for one prompt, and selects the final response through a ranking process.
  • Figure 2: Neuron activation patterns in layer 10 of Bamboo-7B under different batch sizes. The X-axis represents the proportion of neurons (sorted by activation frequency), and the Y-axis shows different batch sizes. Darker red indicates lower activation frequency.
  • Figure 3: (a) Comparison of execution times for matrix-vector multiplication operations with varying computational loads across CPU, GPU, and NPU platforms. The test involves multiplying a 14336×4096 matrix with vectors under different batch sizes. The X-axis represents the batch size while the Y-axis shows execution time. (b) Random read throughput performance for 4KB operations across different block sizes and data ranges. The X-axis depicts block size, while the Y-axis represents throughput.
  • Figure 4: The architecture overview of PowerInfer-2.
  • Figure 5: Two computing workflows for prefill and decoding phases. (a) The prefill phase uses an NPU-centric workflow that leverages NPU for computation; (b) The decoding phase employs a CPU-NPU hybrid workflow for FFN computation where NPU handles dense computations for hot neurons while CPU cores process sparse computations for cold neurons, with their processing ratio automatically adjusting to match the dynamic sparsity patterns caused by varying batch sizes. The attention computation is handled entirely by NPU but not shown in the figure.
  • ...and 9 more figures