Table of Contents
Fetching ...

AdaFV: Rethinking of Visual-Language alignment for VLM acceleration

Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng

TL;DR

This work tackles the inefficiency of vision-language models (VLMs) caused by thousands of visual tokens by proposing AdaFV, a training-free token pruning approach. The authors first show that pre-LLM text embeddings align with visual tokens and that text-to-image similarity remains a reliable cue for token preservation, even when self-attention is biased. They introduce a self-adaptive cross-modality attention mixture (SACMAM) that balances text-to-image similarity and visual saliency via a temperature reweighting and a geometric-mean objective to select a compact set of informative visual tokens. Empirically, AdaFV achieves state-of-the-art training-free acceleration on LLaVA variants across 75–95% token reductions and demonstrates robustness across diverse datasets, offering substantial efficiency gains without additional training costs.

Abstract

The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.

AdaFV: Rethinking of Visual-Language alignment for VLM acceleration

TL;DR

This work tackles the inefficiency of vision-language models (VLMs) caused by thousands of visual tokens by proposing AdaFV, a training-free token pruning approach. The authors first show that pre-LLM text embeddings align with visual tokens and that text-to-image similarity remains a reliable cue for token preservation, even when self-attention is biased. They introduce a self-adaptive cross-modality attention mixture (SACMAM) that balances text-to-image similarity and visual saliency via a temperature reweighting and a geometric-mean objective to select a compact set of informative visual tokens. Empirically, AdaFV achieves state-of-the-art training-free acceleration on LLaVA variants across 75–95% token reductions and demonstrates robustness across diverse datasets, offering substantial efficiency gains without additional training costs.

Abstract

The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
Paper Structure (26 sections, 13 equations, 9 figures, 8 tables)

This paper contains 26 sections, 13 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The performance of different training-free VLM acceleration methods on LLaVA-NEXT-7B. AdaFV significantly increased the robustness of the model, especially for a large reduction rate.
  • Figure 2: The average AUC on different datasets (a) and the distribution of AUC on each dataset (b$\sim$f).
  • Figure 3: Text-to-image similarity distribution of LLaVA-v1.5-7B and LLaVA-NEXT-vicuna-7B.
  • Figure 4: Minimum number of visual tokens to be preserved to select at least one prompt-related visual token, validated on LLaVA-v1.5-7B.
  • Figure 5: The overall pipeline of the proposed approach. We follow standard VLMs to encode the input image and text. We utilize text-to-image similarity to formulate T2I attention and integrate the attention weights of image tokens by calculating the maximum value. We obtain the overall significance by mixing the T2I attention and visual saliency extracted from the vision encoder and selecting the most informative visual tokens accordingly.
  • ...and 4 more figures