Neuralink: Fast LLM Inference on Smartphones with Neuron Co-Activation Linking
Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren
TL;DR
The paper tackles the challenge of running large language models on smartphones by addressing the I/O bottleneck caused by activation sparsity. It introduces Neuralink, a two-stage algorithm–system co-design that places frequently co-activated neurons contiguously in flash memory (offline correlation-aware clustering) and then refines online to maximize read continuity (IOPS-friendly access collapse and linking-aligned caching). By reformulating neuron placement as a Hamiltonian-path problem and applying a greedy heuristic, Neuralink substantially boosts on-device I/O bandwidth and reduces end-to-end latency, achieving average improvements of $1.80\times$ in bandwidth and $1.49\times$ in latency over state-of-the-art baselines across multiple devices and models. The work demonstrates the first storage-placement optimization under activation sparsity, offering a practical path toward more efficient on-device LLM inference through cross-layer algorithm–system co-design.
Abstract
Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Neuralink, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Neuralink leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize I/O efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Neuralink achieves on average $1.49\times$ improvements in end-to-end latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Neuralink explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design for LLM inference.
