Table of Contents
Fetching ...

Rethinking Token Reduction for Large Vision-Language Models

Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang

Abstract

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.

Rethinking Token Reduction for Large Vision-Language Models

Abstract

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
Paper Structure (31 sections, 13 equations, 7 figures, 8 tables, 2 algorithms)

This paper contains 31 sections, 13 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) Overall pipeline of the compression projection training process. (b) Attention distribution over the [CLS] token for retained and all visual tokens. The image tokens are extracted from the last layer of the vision tower of LLaVA-1.5-13b running on VQA-v2 dataset. (c) Attention distribution over the prompt tokens for retained and all visual tokens. The attention scores are averaged to prompt tokens extracted from the first layer of the LLM decoder.
  • Figure 2: Illustration of our proposed MetaCompress, where module $\mathcal{P}_\text{meta}$ generate the compression projection $P$ solely according to the image sequence $X_\text{IMG}$.
  • Figure 3: Comparison of average accuracy on MT-GQA with reduction rate from 50% to 95%.
  • Figure 4: Sensitivity analysis in training MetaCompress for LLaVA-NeXT-7b with different weights $\alpha_\text{entropy}$ and $\alpha_\text{collapse}$ on MT-GQA.
  • Figure 5: (a) Token importance distribution. (b) Attention distribution over prompt tokens. Image tokens are extracted from the last layer of the vision tower of LLaVA-NeXT-7b on VQA-v2 dataset.
  • ...and 2 more figures