Table of Contents
Fetching ...

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, Dong Yu

TL;DR

This work investigates the high computational cost of processing dense visual token sequences in large vision-language models and proposes VScan, a two-stage, training-free token reduction framework. Through empirical analysis, it reveals how visual tokens evolve from local details to global context in encoders and how cross-modal signals stabilize in middle-to-late LLM layers, motivating complementary global/local scanning and middle-layer pruning. VScan integrates a global/local visual scan with token merging (stage one) and middle-layer pruning guided by text relevance (stage two), achieving substantial inference speedups (e.g., 2.91× on LLaVA-NeXT-7B prefilling) with minimal accuracy loss (often >95% of original performance) across 16 benchmarks and four LVLMs. The method generalizes across diverse backbones and supports compatibility with FlashAttention, offering a practical, training-free path to deploy efficient LVLMs in real-time settings.

Abstract

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

TL;DR

This work investigates the high computational cost of processing dense visual token sequences in large vision-language models and proposes VScan, a two-stage, training-free token reduction framework. Through empirical analysis, it reveals how visual tokens evolve from local details to global context in encoders and how cross-modal signals stabilize in middle-to-late LLM layers, motivating complementary global/local scanning and middle-layer pruning. VScan integrates a global/local visual scan with token merging (stage one) and middle-layer pruning guided by text relevance (stage two), achieving substantial inference speedups (e.g., 2.91× on LLaVA-NeXT-7B prefilling) with minimal accuracy loss (often >95% of original performance) across 16 benchmarks and four LVLMs. The method generalizes across diverse backbones and supports compatibility with FlashAttention, offering a practical, training-free path to deploy efficient LVLMs in real-time settings.

Abstract

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91 speedup in prefilling and a 10 reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.

Paper Structure

This paper contains 27 sections, 6 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Comparison of our VScan with representative text-agnostic approaches (e.g., VisionZip yang2024visionzip) and text-aware approaches (e.g., FastV chen2024image). In this work, we introduce VScan, a two-stage, training-free visual token reduction framework that can be seamlessly applied to various open-sourced LVLM architectures, delivering significant acceleration in inference with minimal performance loss.
  • Figure 2: Empirical study on visual redundancy reduction. (Left) We illustrate two failure cases where relying solely on the output [CLS] attention leads to incorrect predictions. For comparison, we include reference token selections from CLIP-ViT-L-336px radford2021learning, following Gandelsman et al. gandelsman2024interpreting, which highlight regions of interest relevant to the text query. (Right) We visualize the [CLS] attention maps and self-attention maps of representative tokens (e.g., #536: ground, #234: person) across different encoding layers, illustrating how attention patterns evolve from localized focus in shallow layers to broader global context in deeper layers.
  • Figure 3: (Left) Study 1: Distribution of retained tokens at a 50% reduction rate in layers 2, 8, and 16 of LLaVA-1.5-7B on POPE li2023evaluating; (Right) Study 2: Sum of visual attention across different attention heads and LLM layers using LLaVA-1.5-7B and Qwen-2.5-VL-7B on POPE li2023evaluating.
  • Figure 4: Study 3: Visualization of next-token predictions derived from the output hidden states of each LLM layer using (a) LLaVA-1.5-7B; (b) LLaVA-NeXT-7B. Darker colors indicate higher prediction confidence.
  • Figure 5: Performance comparisons on Qwen-2.5-VL bai2025qwen2.5 with different LLM sizes (3B/7B/32B) across 3 image understanding benchmarks. We present the performance of different approaches at 4 various retention rates, along with the original model performance without token reduction.
  • ...and 2 more figures