Table of Contents
Fetching ...

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen

TL;DR

<3-5 sentence high-level summary> CoreMatching tackles the high inference cost of Vision-Language Models by revealing and exploiting a mutual relationship between token sparsity and neuron sparsity. It introduces core neurons and core tokens, and a co-adaptive, training-free framework that prunes both dimensions during pre-filling and decoding. A projection-guided criterion links token importance to both attention and angular information, with theoretical support based on orthogonality assumptions and neuron intersections. Empirically, CoreMatching delivers substantial hardware speedups (e.g., 2.1x pre-fill, 9.2x decoding) and near-lossless accuracy across image, video, and LVLM benchmarks, demonstrating strong potential for resource-constrained deployment.

Abstract

Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

TL;DR

<3-5 sentence high-level summary> CoreMatching tackles the high inference cost of Vision-Language Models by revealing and exploiting a mutual relationship between token sparsity and neuron sparsity. It introduces core neurons and core tokens, and a co-adaptive, training-free framework that prunes both dimensions during pre-filling and decoding. A projection-guided criterion links token importance to both attention and angular information, with theoretical support based on orthogonality assumptions and neuron intersections. Empirically, CoreMatching delivers substantial hardware speedups (e.g., 2.1x pre-fill, 9.2x decoding) and near-lossless accuracy across image, video, and LVLM benchmarks, demonstrating strong potential for resource-constrained deployment.

Abstract

Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

Paper Structure

This paper contains 44 sections, 19 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: Schematic diagram of CoreMatching. In the Pre-filling stage, CoreMatching calculates Core Neurons in the FFN block based on the activation. Core Neurons are the most frequently activated group of neurons. Afterwards, CoreMatching matches the neurons activated by different tokens with the core neurons, and selects a group of tokens with the largest intersection as the Core Tokens. Only the Core Tokens are passed to the subsequent layers. During the decoding stage, the model only uses Core Neurons for calculations, and there are only core tokens in the kv cache. CoreMatching achieves comprehensive acceleration for inference of VLMs.
  • Figure 2: Verification of the predictability of core neurons. We visualized the core neurons of the 25-th layer of Llava-1.5-7b when input text token at different lengths. $\rho = 0.2, \beta=0.4$. We selected the first 256 neurons. It can be seen that when the input semantics are sufficient, core neurons are almost unchanged.
  • Figure 3: (Upper) Distribution of $\bigl|\Gamma(x) \,\cap\, \mathcal{C}_\rho^\beta(s)\bigr|$ of image token. The experiment was conducted on Llava-1.5-7b, and we selected the 10th layer. The input image is the rabbit on the left, and the input text is the text token above the image. We use red font to emphasize the key points of the text token. (Note that since core neurons themselves account for 40% of neurons, intersection of about 2000 can be regarded as random sample.) (Lower) Core token under different inputs. The left is the schematic diagram of the maximum geometric distance method to select the threshold. The right side is the core token retained under the distribution of the corresponding image above.
  • Figure 4: Diagram of attention score and projection value. ✓ indicates the token is reserved under this matric. ✗ indicates discarded.
  • Figure 5: Comparison of three metrics. The input is the rabbit in Fig. \ref{['fig_core_token']} and “What color clothes is the rabbit wearing?”. Experiment is conducted on Llava-1.5-7b and the $10$-th layer is selected.
  • ...and 12 more figures