Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Jialuo He; Huangxun Chen

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Jialuo He, Huangxun Chen

TL;DR

E-AdaPrune is proposed, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space that allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters.

Abstract

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

TL;DR

Abstract

Paper Structure (24 sections, 6 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 6 equations, 5 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Visual Token Redundancy in LVLMs
Fixed-Budget Token Reduction
Adaptive Reduction
Method
Preliminaries
Vision-Language Models.
Visual Token Redundancy.
E-AdaPrune: Energy-Based Adaptive Pruning
Singular Value Decomposition of $\textbf{Z}^V$.
Randomized SVD.
Experiment
Experiment Setup
Datasets.
...and 9 more sections

Figures (5)

Figure 1: Different images contain different amounts of visual information, indicating that a static token budget may either discard critical details or retain unnecessary redundancy. All examples are from TextVQA textvqa.
Figure 2: VLM forward process architecture. Circled numbers indicate viable locations for visual token pruning: ① the vision-LLM interface and ② intermediate LLM layers.
Figure 3: Comparison of static and adaptive pruning. (a) FastV uses a fixed top-$k$ budget regardless of image content. (b) E-AdaPrune determines a content-aware budget $k^*$ via an image-specific energy criterion, optimizing token retention for varying information densities. $V$ and $R$ denote token importance scores and rankings.
Figure 4: Singular value spectra comparison. Simple image (a) shows sharp decay with $k^{*}=95$. Hard image (b) has a flat spectrum requiring $k^{*}=259$. Red dashed lines indicate the adaptive rank at $\tau=99\%$.
Figure 5: TextVQA visualization ($\tau=99.0\%$). Compared to static FastV (bottom), FastV+E (middle) adaptively allocates tokens—retaining more for dense scenes to ensure accuracy and fewer for simple scenes to improve efficiency. Green and red indicate correct and incorrect responses, respectively.

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

TL;DR

Abstract

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)