Table of Contents
Fetching ...

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Jialuo He, Huangxun Chen

TL;DR

E-AdaPrune is proposed, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space that allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters.

Abstract

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

TL;DR

E-AdaPrune is proposed, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space that allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters.

Abstract

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.
Paper Structure (24 sections, 6 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 6 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Different images contain different amounts of visual information, indicating that a static token budget may either discard critical details or retain unnecessary redundancy. All examples are from TextVQA textvqa.
  • Figure 2: VLM forward process architecture. Circled numbers indicate viable locations for visual token pruning: ① the vision-LLM interface and ② intermediate LLM layers.
  • Figure 3: Comparison of static and adaptive pruning. (a) FastV uses a fixed top-$k$ budget regardless of image content. (b) E-AdaPrune determines a content-aware budget $k^*$ via an image-specific energy criterion, optimizing token retention for varying information densities. $V$ and $R$ denote token importance scores and rankings.
  • Figure 4: Singular value spectra comparison. Simple image (a) shows sharp decay with $k^{*}=95$. Hard image (b) has a flat spectrum requiring $k^{*}=259$. Red dashed lines indicate the adaptive rank at $\tau=99\%$.
  • Figure 5: TextVQA visualization ($\tau=99.0\%$). Compared to static FastV (bottom), FastV+E (middle) adaptively allocates tokens—retaining more for dense scenes to ensure accuracy and fewer for simple scenes to improve efficiency. Green and red indicate correct and incorrect responses, respectively.