Table of Contents
Fetching ...

Attention Debiasing for Token Pruning in Vision Language Models

Kai Zhao, Wubang Yuan, Yuchen Lin, Liting Ruan, Xiaofeng Lu, Deng-Ping Fan, Ming-Ming Cheng, Dan Zeng

TL;DR

This work tackles the heavy computation in vision-language models caused by large visual token counts by showing that attention-based token pruning is biased due to recency and padding sinks, biases inherited from LLMs. It introduces two training-free debiasing techniques: positional debiasing using an exponential trend model to yield a content-aware, position-agnostic score, and padding attention suppression to zero out padding token attention, forming a plug-and-play improvement to existing pruning methods. The approach, implemented as a lightweight module, yields consistent gains across ten image benchmarks and three video benchmarks across multiple VLM architectures, with negligible overhead. It also provides a detailed analysis of attention biases, showing that the debiasing reduces bias strengths to near-zero and that RoPE removal alone is insufficient to address these issues, highlighting the practical impact of bias-aware pruning in real-world multimodal systems.

Abstract

Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency, and language-to-vision attention has become a widely used importance criterion for this purpose. However, we find that attention in VLMs is systematically biased. It disproportionately favors tokens appearing later in the sequence, manifesting as over-attention to lower image regions, and assigns inflated scores to semantically empty padding tokens. These behaviors stem from intrinsic recency bias and attention sink effects inherited from large language models (LLMs), and they distort attention-based pruning by preserving irrelevant visual content. To derive a pruning criterion better aligned with semantic relevance, we introduce two lightweight yet effective debiasing techniques that restore the reliability of attention. The first compensates for positional distortions by removing recency-induced attention trends, producing a content-aware and position-agnostic importance measure. The second suppresses attention sink effects by eliminating spurious attention on padding tokens. Our method is model-agnostic, pruning-method-agnostic, and task-agnostic, enabling plug-and-play integration with existing VLM pruning models. Despite its simplicity, our approach consistently delivers strong performance gains. We evaluate our method on ten vision-language benchmarks spanning both image-based and video-based tasks, in comparison with seven state-of-the-art visual token pruning methods and across two representative VLM architectures. Our method achieves substantial performance gains, demonstrating strong effectiveness and generalizability. Our code is available at https://github.com/intcomp/attention-bias.

Attention Debiasing for Token Pruning in Vision Language Models

TL;DR

This work tackles the heavy computation in vision-language models caused by large visual token counts by showing that attention-based token pruning is biased due to recency and padding sinks, biases inherited from LLMs. It introduces two training-free debiasing techniques: positional debiasing using an exponential trend model to yield a content-aware, position-agnostic score, and padding attention suppression to zero out padding token attention, forming a plug-and-play improvement to existing pruning methods. The approach, implemented as a lightweight module, yields consistent gains across ten image benchmarks and three video benchmarks across multiple VLM architectures, with negligible overhead. It also provides a detailed analysis of attention biases, showing that the debiasing reduces bias strengths to near-zero and that RoPE removal alone is insufficient to address these issues, highlighting the practical impact of bias-aware pruning in real-world multimodal systems.

Abstract

Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency, and language-to-vision attention has become a widely used importance criterion for this purpose. However, we find that attention in VLMs is systematically biased. It disproportionately favors tokens appearing later in the sequence, manifesting as over-attention to lower image regions, and assigns inflated scores to semantically empty padding tokens. These behaviors stem from intrinsic recency bias and attention sink effects inherited from large language models (LLMs), and they distort attention-based pruning by preserving irrelevant visual content. To derive a pruning criterion better aligned with semantic relevance, we introduce two lightweight yet effective debiasing techniques that restore the reliability of attention. The first compensates for positional distortions by removing recency-induced attention trends, producing a content-aware and position-agnostic importance measure. The second suppresses attention sink effects by eliminating spurious attention on padding tokens. Our method is model-agnostic, pruning-method-agnostic, and task-agnostic, enabling plug-and-play integration with existing VLM pruning models. Despite its simplicity, our approach consistently delivers strong performance gains. We evaluate our method on ten vision-language benchmarks spanning both image-based and video-based tasks, in comparison with seven state-of-the-art visual token pruning methods and across two representative VLM architectures. Our method achieves substantial performance gains, demonstrating strong effectiveness and generalizability. Our code is available at https://github.com/intcomp/attention-bias.

Paper Structure

This paper contains 24 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Average performance of multiple VLMs across ten image-based vision–language QA benchmarks, where each vertex corresponds to a pruning method. Our method consistently improves performance across all six pruning methods.
  • Figure 2: The average text-to-vision attention scores in LLaVA-v1.5-7B liu2023visual before (left) and after (right) applying our debiasing techniques. The original attention scores exhibit a strong recency bias, favoring visual tokens from lower image regions.
  • Figure 3: Qualitative visualization of visual token pruning results for FastV, PyramidDrop, SparseVLM, and HiMAP. Retained visual tokens are overlaid on the input images for each method before and after applying our approach. Existing pruning methods tend to preserve tokens in padded or bottom regions while discarding fine-grained, question-relevant patches. In contrast, our method suppresses the retention of padded regions and consistently preserves semantically important visual tokens, leading to more accurate predictions.
  • Figure 4: Visualization of visual token selection frequency across different image aspect ratios (square, portrait, and landscape) on TextVQA. Brighter regions denote higher token selection frequency. FastV shows strong bias toward padded or lower image regions in portrait and landscape inputs, where outlier tokens are selected disproportionately often, whereas our method yields a more balanced and spatially uniform selection pattern.
  • Figure 5: Text-to-vision attention trends in three settings: original attention, the average attention debias using \ref{['eq:naive-bias']}, and the fitted exponential trend debiasing using \ref{['eq:fit-bias']}. Directly using the average attention leaves clear positional bias, while exponential fitting yields a smoother, position-agnostic attention trend.
  • ...and 2 more figures