Table of Contents
Fetching ...

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, Yang You

TL;DR

The paper tackles the efficiency bottleneck of large vision-language models caused by excessive visual tokens. It introduces a training-free framework, SGL, that uses a small VLM to aggregate attention maps across all layers for precise visual-token ranking (SGP) and a complementary early exiting mechanism (SEE) to avoid invoking the large VLM when unnecessary. Across 11 benchmarks and multiple model sizes, SGL achieves up to 91% visual-token pruning while preserving competitive accuracy, demonstrating strong efficiency gains and broad generalizability. The approach offers a practical, training-free pathway to accelerate VLM inference without compromising cross-modal understanding.

Abstract

Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce a \textbf{training-free} method, \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for accelerating \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the attention map aggregated from a small VLM to guide visual token pruning in a large VLM. Additionally, an early exiting mechanism is developed to fully use the small VLM's predictions, dynamically invoking the larger VLM only when necessary, yielding a superior trade-off between accuracy and computation. Extensive evaluations across 11 benchmarks demonstrate the effectiveness and generalizability of SGL, achieving up to 91\% pruning ratio for visual tokens while retaining competitive performance.

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

TL;DR

The paper tackles the efficiency bottleneck of large vision-language models caused by excessive visual tokens. It introduces a training-free framework, SGL, that uses a small VLM to aggregate attention maps across all layers for precise visual-token ranking (SGP) and a complementary early exiting mechanism (SEE) to avoid invoking the large VLM when unnecessary. Across 11 benchmarks and multiple model sizes, SGL achieves up to 91% visual-token pruning while preserving competitive accuracy, demonstrating strong efficiency gains and broad generalizability. The approach offers a practical, training-free pathway to accelerate VLM inference without compromising cross-modal understanding.

Abstract

Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce a \textbf{training-free} method, \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for accelerating \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the attention map aggregated from a small VLM to guide visual token pruning in a large VLM. Additionally, an early exiting mechanism is developed to fully use the small VLM's predictions, dynamically invoking the larger VLM only when necessary, yielding a superior trade-off between accuracy and computation. Extensive evaluations across 11 benchmarks demonstrate the effectiveness and generalizability of SGL, achieving up to 91\% pruning ratio for visual tokens while retaining competitive performance.

Paper Structure

This paper contains 32 sections, 9 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The motivation of our SGL. (a) A single-layer attention map is suboptimal compared to the global attention maps aggregated from all layers. We take InternVL2 chen2024far of 2B and 26B as representative examples. FastV chen2024image prunes visual tokens using the attention map from a single-layer, whereas FastV-oracle employs the aggregated attention map across all layers during inference. This approach allows for precise pruning of less significant visual tokens, maintaining performance with only 9% of the tokens retained. (b) The small VLM exhibits a token retention pattern similar to the 26B model, preserving essential viusal tokens relevant to the answer, regardless of the answer correctness. We drop 80% less significant visual tokens and adopt to mark those tokens with high attention scores. Thumbnails employed in InternVL2 chen2024far are presented in the left corner. (c) The performance gap between small and large VLM is minimal compared to their computation disparity. The 2B model achieves competitive performance with significantly fewer FLOPs compared to the 26B one. This also validates our soundness of using a small model to guide early exiting and token pruning in the large one.
  • Figure 2: Overview of SGL. (a) Small VLM-guided visual token pruning in a large VLM (SGP). We update a global attention map aggregated from all layer of a small VLM. This global attention map is used to rank visual tokens and guide the visual token pruning in a large VLM. (b) Aggregation of attention maps in SGP. We aggregate the attention score of visual tokens received from prompt tokens and generated tokens across all heads and layers in the small LM. Higher scores indicate greater significance. (c) Inference with Small VLM Early Exiting (SEE). When the early exiting decision score from the small VLM is sufficient, the larger VLM will not be invoked.
  • Figure 3: Performance-efficiency curves of SGL (SGP + SEE). The results with 18%, 35%, 50%, and 64% visual token retention ratios are presented as a curve. For the 26B and 40B, we use an NVIDIA H20 GPU, and the 76B is sharded on two GPUs.
  • Figure 4: Comparison of different early-exiting decision scores. We present the area between each strategy's curve and the 2B model score alongside their names. A larger area indicates a more effective criterion. With the same early exiting ratio, a higher score reflects improved accuracy in identifying incorrect responses from the small VLM. Note that SGP is not adopted for clear comparison.
  • Figure 5: Visualization of SGP under different visual token retention ratios and answers. Visual tokens are pruned by 60%, 80%, and 95% at the 19th, 9th, and 2nd layers of the large VLM of 26B, which comprises 48 layers. This results in average token retention ratios of 64%, 35%, and 9%, respectively. Retained tokens are highlighted with . Thumbnails employed in InternVL are presented in the left corner.
  • ...and 3 more figures