Table of Contents
Fetching ...

PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu

Abstract

Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

Abstract

Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression () as well as controlled lossy compression (). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2 inference speedup and 1.9 training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

Paper Structure

This paper contains 36 sections, 2 theorems, 4 equations, 3 figures, 12 tables.

Key Result

Proposition 1

For $\tau = 0$, the original patch sequence $\mathcal{P} = \{P_1, \dots, P_N\}$ is exactly recoverable from the compressed representation $\mathcal{C} = \{(P_k, \mathbf{p}_k)\}_{k \in \mathcal{S}}$ via a deterministic decoder that applies the same prediction rule in the same scan order. $\blacktrian

Figures (3)

  • Figure 1: PixelPrune patch selection on a document image (left) and GUI screenshot (right). Kept patches are shown in original color; dropped patches are grayed out. Token reduction: 70% (document) and 93% (GUI).
  • Figure 2: Prefill latency breakdown across Qwen3-VL scales (2B, 4B, 8B) at five resolutions (256$^2$--4096$^2$, log scale). Each bar splits into Vision Encoder (red, including ViT and Patch Merger) and LLM Prefill (blue). At 4096$^2$, the vision encoder accounts for 86%, 72%, and 75% of total prefill time for the 2B, 4B, and 8B models respectively.
  • Figure 3: Illustration of PixelPrune's Pred-2D prediction. For each target patch $X$, three causal neighbors---$A$ (left), $B$ (upper), $C$ (upper-left)---determine the predicted patch $\hat{X}$. The rule selects the neighbor most likely to match $X$: when the upper and upper-left patches agree, the target is more likely to follow the left neighbor, and vice versa. If $X$ matches its prediction, it is omitted; otherwise, it is retained.

Theorems & Definitions (2)

  • Proposition 1: Exact Reconstruction
  • Proposition 2: Bounded Reconstruction Error