Table of Contents
Fetching ...

ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong

TL;DR

This work proposes ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones, and estimates token sensitivity using zeroth-order perturbations at the lightweight projection layer.

Abstract

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token's influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.

ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

TL;DR

This work proposes ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones, and estimates token sensitivity using zeroth-order perturbations at the lightweight projection layer.

Abstract

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token's influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.

Paper Structure

This paper contains 42 sections, 2 theorems, 11 equations, 14 figures, 8 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $M:\!\mathbb{R}^n \!\to\! \mathbb{R}^m$ be differentiable at $x \!\in\! \mathbb{R}^n$ with Jacobian $J(x) \!=\! \nabla M(x)$. Let $u \sim \mathcal{N}(0, I_n)$ be an isotropic Gaussian perturbation and $h \!>\! 0$ a small step size. Define the finite-difference sensitivity $S(x)\!=\!\mathbb{E}_u\

Figures (14)

  • Figure 1: Illustration of training-free VLM token pruning methods. (a) Attention-based methods select tokens using attention scores, but often retain redundant tokens. (b) Diversity-based methods select tokens with different features to maximize coverage but may lose tokens located in semantically relevant regions (e.g., around the monitor, highlighted in yellow). (c) Our method employs zeroth-order gradient estimation to quantify token sensitivity and integrates these scores into a diversity objective. (d) Accuracy comparison with LLaVA-NeXT-7B across 9 benchmarks, showing that ours outperforms both VisionZip (attention-based) and DivPrune (diversity-based).
  • Figure 2: Kernel density estimate (KDE) of Spearman rank correlations between token-importance rankings from the Vision encoder and the Projection layer on the MMMU and POPE datasets. Each dataset shows Spearman correlation of 0.55 and 0.49, respectively. Detailed setting is described in Appendix A.
  • Figure 2: Performance comparison on Qwen2.5-VL-7B.
  • Figure 3: Overview of ZOO-Prune. Given visual tokens from the vision encoder, we estimate token sensitivity via zeroth-order gradient approximation at the projection layer by adding Gaussian perturbations (i.e., $x_i \pm hu_j$). The resulting sensitivity scores are integrated with a diversity objective to form a hybrid score, guiding the selection. The selected subset is then passed to the LLM together with the text input, enabling efficient multimodal reasoning with reduced computation.
  • Figure 4: Hyperparameter sensitivity on POPE with LLaVA-1.5-7B: (a) small step size $h$, (b) number of perturbation directions $m$.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Approximated Mean Sensitivity
  • Proposition B.1: Approximated Mean Sensitivity
  • proof