Table of Contents
Fetching ...

Neuralink: Fast LLM Inference on Smartphones with Neuron Co-Activation Linking

Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren

TL;DR

The paper tackles the challenge of running large language models on smartphones by addressing the I/O bottleneck caused by activation sparsity. It introduces Neuralink, a two-stage algorithm–system co-design that places frequently co-activated neurons contiguously in flash memory (offline correlation-aware clustering) and then refines online to maximize read continuity (IOPS-friendly access collapse and linking-aligned caching). By reformulating neuron placement as a Hamiltonian-path problem and applying a greedy heuristic, Neuralink substantially boosts on-device I/O bandwidth and reduces end-to-end latency, achieving average improvements of $1.80\times$ in bandwidth and $1.49\times$ in latency over state-of-the-art baselines across multiple devices and models. The work demonstrates the first storage-placement optimization under activation sparsity, offering a practical path toward more efficient on-device LLM inference through cross-layer algorithm–system co-design.

Abstract

Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Neuralink, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Neuralink leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize I/O efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Neuralink achieves on average $1.49\times$ improvements in end-to-end latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Neuralink explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design for LLM inference.

Neuralink: Fast LLM Inference on Smartphones with Neuron Co-Activation Linking

TL;DR

The paper tackles the challenge of running large language models on smartphones by addressing the I/O bottleneck caused by activation sparsity. It introduces Neuralink, a two-stage algorithm–system co-design that places frequently co-activated neurons contiguously in flash memory (offline correlation-aware clustering) and then refines online to maximize read continuity (IOPS-friendly access collapse and linking-aligned caching). By reformulating neuron placement as a Hamiltonian-path problem and applying a greedy heuristic, Neuralink substantially boosts on-device I/O bandwidth and reduces end-to-end latency, achieving average improvements of in bandwidth and in latency over state-of-the-art baselines across multiple devices and models. The work demonstrates the first storage-placement optimization under activation sparsity, offering a practical path toward more efficient on-device LLM inference through cross-layer algorithm–system co-design.

Abstract

Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Neuralink, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Neuralink leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize I/O efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Neuralink achieves on average improvements in end-to-end latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Neuralink explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design for LLM inference.

Paper Structure

This paper contains 23 sections, 4 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: The bandwidth and IOPS during inference across various LLMs on OnePlus Ace2. Neuralink shifts the I/O bottleneck from IOPS (lower right) to bandwidth (upper left).
  • Figure 2: Activation sparsity introduced by ReLU. Each element in the intermediate activations $A$ with a zero value (colored in green) deactivates two neurons (uncolored): the corresponding row in up-projection matrix $U$ and the column in down-projection matrix $D$ within the FFN block.
  • Figure 3: A three-step procedure for LLM inference on smartphones leveraging activation sparsity: ❶ Identify the activated neurons for a given input using predictors deja-vupowerinfer or sparsity-aware metrics q-sparseslm-activation-sparsity. ❷ With the full parameters stored in flash memory, load only the activated neurons into DRAM. ❸ Perform inference using the activated neurons.
  • Figure 4: UFS Bandwidth at varying continuous I/O sizes on smartphones. The near-linear relationship indicates that the bottleneck lies in IOPS, rather than the bandwidth capacity.
  • Figure 5: Visualization of neuron co-activation patterns across different LLMs (vertical) and datasets (horizontal). Each matrix represents the adjacency matrix of neurons within a layer of a given LLM, where the element at position $(i, j)$ indicates the co-activation frequency between neuron $i$ and neuron $j$. Brighter colors denote high values.
  • ...and 12 more figures