AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Changwoo Baek; Jouwon Song; Sohyeon Kim; Kyeongbo Kong

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Changwoo Baek, Jouwon Song, Sohyeon Kim, Kyeongbo Kong

TL;DR

It is shown that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance and that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features.

Abstract

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

TL;DR

Abstract

Paper Structure (49 sections, 9 equations, 18 figures, 18 tables)

This paper contains 49 sections, 9 equations, 18 figures, 18 tables.

Introduction
Related Works
Large Vision-Language Models
Visual token reduction
PRELIMINARIES
Visual token pruning.
Attention concentration via attention entropy.
Token embedding diversity via erank.
EMPIRICAL STUDIES
Empirical Analysis of Attention-Based and Diversity-Based Pruning
Analyzing Diversity Preservation in Existing Pruning Paradigms via erank
Quantitative Comparison of Diversity Preservation via erank
Relative Strengths of Diversity Mechanisms Across Methods
The Relationship between Pruning Methods and Hallucination
Object hallucination.
...and 34 more sections

Figures (18)

Figure 1: Overview of attention entropy and erank.
Figure 2: Response patterns of DivPrune (diversity-based) vs. FasterVLM (attention-based). DivPrune's responses are more comprehensive but risk hallucination, whereas FasterVLM produces safer, more focused descriptions. In the annotations, GT Obj. and Hallucinated Obj. label object words; marks DivPrune-specific phrasing; red text indicates incorrect phrases.
Figure 3: Diversity vs. attention in pruning across datasets and image complexities. (a) High-erank methods perform better on complex datasets (POPE), while low-erank methods excel on simple datasets (ScienceQA). (b) Simple images show low entropy and erank, leading to concentrated attention suitable for attention-based pruning. Complex images show high entropy and erank, where diversity-based pruning becomes more effective.
Figure 4: Effect of similarity threshold $\tau$ on token selection. A low (strict) $\tau$ prioritizes high-attention tokens, while a high (loose) $\tau$ increases the diversity of the selected tokens.
Figure 5: Attention entropy vs. erank on MME dataset.
...and 13 more figures

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

TL;DR

Abstract

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)