Table of Contents
Fetching ...

The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating

Landi He, Xiaoyu Yang, Lijian Xu

TL;DR

AutoSelect is proposed, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations, and transfers to different VLM backbones without architecture-specific tuning.

Abstract

Visual tokens dominate inference cost in vision-language models (VLMs), yet many carry redundant information. Existing pruning methods alleviate this but typically rely on attention magnitude or similarity scores. We reformulate visual token pruning as capacity constrained communication: given a fixed budget K, the model must allocate limited bandwidth to maximally preserve visual information. We propose AutoSelect, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations. During training, a variance preserving noise gate modulates each token's information flow according to its predicted importance so that gradients propagate through all tokens; a diagonal attention Denoiser then recovers the perturbed representations. At inference, only the Scorer and a hard top-K selection remain, adding negligible latency. On ten VLM benchmarks, AutoSelect retains 96.5% of full model accuracy while accelerating LLM prefill by 2.85x with only 0.69 ms overhead, and transfers to different VLM backbones without architecture-specific tuning. Code is available at https://github.com/MedHK23/AutoSelect.

The Model Knows Which Tokens Matter: Automatic Token Selection via Noise Gating

TL;DR

AutoSelect is proposed, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations, and transfers to different VLM backbones without architecture-specific tuning.

Abstract

Visual tokens dominate inference cost in vision-language models (VLMs), yet many carry redundant information. Existing pruning methods alleviate this but typically rely on attention magnitude or similarity scores. We reformulate visual token pruning as capacity constrained communication: given a fixed budget K, the model must allocate limited bandwidth to maximally preserve visual information. We propose AutoSelect, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations. During training, a variance preserving noise gate modulates each token's information flow according to its predicted importance so that gradients propagate through all tokens; a diagonal attention Denoiser then recovers the perturbed representations. At inference, only the Scorer and a hard top-K selection remain, adding negligible latency. On ten VLM benchmarks, AutoSelect retains 96.5% of full model accuracy while accelerating LLM prefill by 2.85x with only 0.69 ms overhead, and transfers to different VLM backbones without architecture-specific tuning. Code is available at https://github.com/MedHK23/AutoSelect.
Paper Structure (27 sections, 5 equations, 5 figures, 5 tables)

This paper contains 27 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Two views of visual token pruning.(Top) Hard pruning directly discards a subset of tokens. (Bottom) Our capacity-constrained formulation retains all tokens but limits the total information throughput to the same budget through a bandwidth-limited channel.
  • Figure 2: Overview of the AutoSelect framework. Visual tokens from the frozen Image Encoder pass through a learnable Scorer that assigns per-token importance scores. These scores are polarized by the differentiable Soft Top-$K$ operator under a fixed bandwidth budget $K$. During training (lower path), a VP Noise Gate injects variance-preserving noise into each token in inverse proportion to its score; the Denoiser then maps the perturbed sequence back toward the LLM's expected input space. At inference (upper path), the Denoiser and noise injection are discarded: Hard Top-$K$ retains the $K$ highest-scoring tokens with their original position indices. All base VLM parameters, including the Image Encoder, modality projector, and LLM, remain frozen.
  • Figure 3: LLM-free classification on ImageNet-1K. Each method generates a selection mask on $24{\times}24$ token grid, which is resized to $14{\times}14$ and applied to a ViT-B/16 by removing unselected patches before embedding.
  • Figure 4: Token selection and pairwise similarity ($K{=}64$). Red/blue patches denote the 64 highest-/lowest-scored tokens. The $128{\times}128$ cosine-similarity matrix is arranged as $[\text{top-}64\;|\;\text{bottom-}64]$. Retained tokens (upper-left block) are dissimilar; pruned tokens (lower-right block) are highly similar.
  • Figure 5: Validation of VP-noise gating as a differentiable proxy for hard Top-$K$ pruning. Both strategies are applied after patch embedding and before the vision encoder, matching the insertion point used by AutoSelect. Both use identical random score maps to control for scoring quality. Quantitatively (a) and qualitatively (b), the two mechanisms produce equivalent information restriction across all budget levels, justifying VP-noise gating as a continuous, gradient-friendly training surrogate.