Table of Contents
Fetching ...

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng

TL;DR

ApET, an Approximation-Error guided Token compression framework, which first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens.

Abstract

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

TL;DR

ApET, an Approximation-Error guided Token compression framework, which first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens.

Abstract

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.
Paper Structure (11 sections, 1 theorem, 5 equations, 4 figures, 7 tables)

This paper contains 11 sections, 1 theorem, 5 equations, 4 figures, 7 tables.

Key Result

Theorem 1

(Lower bound on the minimal reconstruction MSE $\xi$). Let $H(x|z)$ denote the conditional entropy of the input $x$ given the intermediate feature $z$. The minimal reconstruction MSE $\xi$ is bounded by: $\xi \ge \frac{1}{(2\pi e)}exp(\frac{2H({x}|{z})}{d})$.

Figures (4)

  • Figure 1: Performance–Efficiency Comparison. (Left) Performance across various image- and video-understanding benchmarks, ApET significantly outperforms existing token reduction approaches. (Right) ApET can be seamlessly combined with FlashAttention (FA) to further reduce prefilling time on Qwen-2.5-VL, whereas prior token reduction approaches are incompatible with FlashAttention.
  • Figure 2: Comparison of the attention map and approximate error map. Attention-guided methods exhibit a positional bias that is agnostic to actual information content, assigning disproportionately high importance to later positions regardless of their semantic relevance (red boxes). In contrast, the error map provides an intuitive and content-aware reflection of token importance.
  • Figure 3: The overview of ApET. ApET selectively filters tokens exhibiting minimal information content based on approximation error, and subsequently employs token merging strategies to accomplish token compression.
  • Figure 4: Visualization of token compression. We present four representative failure cases in which attention-driven token selection misguides the final prediction. For each case, we visualize the input question and its ground-truth answer, the original image, the subset of visual tokens preserved when ranking by attention weights, and the subset preserved by the approximation-error criterion proposed in this work. The retained tokens are highlighted for clear comparison.

Theorems & Definitions (1)

  • Theorem