Table of Contents
Fetching ...

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Eduard Allakhverdov, Elizaveta Goncharova, Andrey Kuznetsov

TL;DR

This work tackles the inefficiency of vision encoders by reducing visual token counts without sacrificing performance. It introduces an interpretable autoencoder-based pipeline with a Gumbel-Softmax selector to isolate the most informative visual tokens, enabling adaptive pruning in multimodal models. Empirical results on LLaVA-NeXT and LLaVA-OneVision show that up to ~50% of visual context can be removed with negligible loss on OCR-like tasks and even higher reductions (up to ~90%) in some settings, without requiring downstream fine-tuning. The approach promises faster, memory-efficient inference and lays groundwork for scalable multimodal reasoning, with limitations relating to compatibility with some interpolation-based compression methods and potential gains from joint fine-tuning in future work.

Abstract

Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

TL;DR

This work tackles the inefficiency of vision encoders by reducing visual token counts without sacrificing performance. It introduces an interpretable autoencoder-based pipeline with a Gumbel-Softmax selector to isolate the most informative visual tokens, enabling adaptive pruning in multimodal models. Empirical results on LLaVA-NeXT and LLaVA-OneVision show that up to ~50% of visual context can be removed with negligible loss on OCR-like tasks and even higher reductions (up to ~90%) in some settings, without requiring downstream fine-tuning. The approach promises faster, memory-efficient inference and lays groundwork for scalable multimodal reasoning, with limitations relating to compatibility with some interpolation-based compression methods and potential gains from joint fine-tuning in future work.

Abstract

Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.

Paper Structure

This paper contains 24 sections, 7 equations, 8 figures.

Figures (8)

  • Figure 1: Comparison of feature selection methods on Newton's Principia text: original image (left), random feature selection retaining 40% of tokens (middle), and our proposed feature selector retaining 40% of tokens (right).
  • Figure 2: Illustration of the Feature Selector in training mode. It uses three Transformer layers and a Gumbel-Softmax head to generate a binary mask where zeros mark tokens for removal and ones for retention. During training, the masked embeddings are replaced by a shared learnable embedding. During inference, the masked embeddings are discarded, while the retained ones are used for downstream tasks, such as image representations in Vision-Language models.
  • Figure 3: Illustration of Feature Reconstructor's functionality. Its primary objective is to restore the tokens that were replaced with a learned representation.
  • Figure 4: Comparison of LLaVA-NeXT performance with our selector (orange) and random selector (blue) on text-based benchmarks. The green dashed line represents the baseline performance using all features. The red dashed line represents the model's performance without image input.
  • Figure 5: Comparison of LLaVA-NeXT performance with our selector (orange) and random selector (blue) on non-text benchmarks. The green dashed line represents the baseline performance using all features. The red dashed line represents the model's performance without image input.
  • ...and 3 more figures