Table of Contents
Fetching ...

Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs

Jaehoon Lee, Mingi Jung, Soohyuk Jang, Seungryong Yoo, Dahuin Jung, Sungroh Yoon

Abstract

Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.

Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs

Abstract

Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.
Paper Structure (44 sections, 19 equations, 10 figures, 6 tables)

This paper contains 44 sections, 19 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Saliency–Coverage Trade-Off across two distinct benchmarks. (a) Performance trends of POPE and SQA under different saliency–coverage allocations with a fixed token budget. (b) Example samples illustrating the contrasting semantic prominence distributions of the two benchmarks.
  • Figure 2: Sensitivity of entropy metrics to semantic prominence distribution. Mean normalized entropy values across semantic prominence distribution levels for feature norm entropy (Norm), attention entropy (Attn), and singular value spectral entropy (Spectral).
  • Figure 3: Overview of the proposed PromPrune framework. Given an input image, the projected visual tokens are first processed by entropy-guided budget allocation to determine the saliency and coverage budgets. A two-stage token selection pipeline then retains locally salient tokens and selects diverse tokens from the unselected token pool to ensure global coverage before passing the compressed tokens to the LLM.
  • Figure 4: Comparison of proxy metrics for estimating the distribution of semantic prominence. We compare spectral entropy with two alternative proxy metrics, feature norm entropy and attention entropy on MME, POPE, GQA benchmarks.
  • Figure 5: Robustness of Sigmoidal Mapping Hyperparameters. We vary $\mu$ and $\tau$ in the proposed allocation function and evaluate the resulting performance on SQA, POPE, GQA benchmarks.
  • ...and 5 more figures