Table of Contents
Fetching ...

Window Token Concatenation for Efficient Visual Large Language Models

Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

TL;DR

WiCo tackles the token-cost problem in Visual Large Language Models by projecting the visual token map $\mathbf{v} \in \mathbb{R}^{n \times D_v}$ to a smaller set $\mathbf{v}_l \in \mathbb{R}^{k \times D_l}$ using a 2D window concatenation strategy that is informed by adaptive tuning of the last $K_v$ layers of the vision encoder. WiCo+ extends this by upsampling visual tokens in the late layers of the LLM decoder, enabling better fine-grained visual understanding through a hierarchy of window-level and patch-level attention. Empirical results on LLaVA-1.5 and Shikra show that WiCo(+ ) achieves superior or near parity performance with only a fraction of the original tokens across general VQA and grounding tasks, reducing training and inference costs while preserving task sensitivity. The approach demonstrates significant practical impact for deploying high-resolution VLLMs and suggests promising directions for extending token-reduction techniques to video and other modalities.

Abstract

To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.

Window Token Concatenation for Efficient Visual Large Language Models

TL;DR

WiCo tackles the token-cost problem in Visual Large Language Models by projecting the visual token map to a smaller set using a 2D window concatenation strategy that is informed by adaptive tuning of the last layers of the vision encoder. WiCo+ extends this by upsampling visual tokens in the late layers of the LLM decoder, enabling better fine-grained visual understanding through a hierarchy of window-level and patch-level attention. Empirical results on LLaVA-1.5 and Shikra show that WiCo(+ ) achieves superior or near parity performance with only a fraction of the original tokens across general VQA and grounding tasks, reducing training and inference costs while preserving task sensitivity. The approach demonstrates significant practical impact for deploying high-resolution VLLMs and suggests promising directions for extending token-reduction techniques to video and other modalities.

Abstract

To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.

Paper Structure

This paper contains 15 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Motivations of our method. (a) Current projector types (left) and ours (right) for VLLM token reduction. Existing token reduction projectors are mainly based on (i) selection, (ii) merging, (iii) concatenation and (iv) cross-attention. (b) illustrates that the performance of VLLMs is sensitive to the types of downstream tasks when changing the number of visual tokens. Specifically, the performance of VLLMs will decrease more for fine-grained understanding tasks compared to the course-grained ones when reducing the visual tokens.
  • Figure 2: Framework of our WiCo$+$. WiCo$+$ consists of two main components, i.e., a dynamic window token concatenation projector (WiCo) and the token decomposition strategy in the later layers of the LLM decoder. WiCo first learns similar local token representations by $k_v$ self-attention layers from the last $k_v$ layers of a pretrained vision encoder (say CLIP). Then, a sliding window is adopted on the 2-D token map to perform concatenation, and an MLP is utilized to project these visual tokens into language space. To further enhance the perception field of the rest visual tokens, we decompose the visual tokens in the later layers (say the last $K_l$ layers) of the LLM decoder, which will benefit the fine-grained understanding tasks.
  • Figure 3: The visual feature map (mean pooling) comparison on LLaVA-1.5, obtained from the pretrained CLIP vision encoder by tuning the last few layers (right) and freezing all layers (middle). The tuned CLIP can learn smoother features than the frozen one, indicating that the tokens are similar in the sliding window.
  • Figure 4: Visual token decomposition strategies by upsampling from (a) number and (b) channel dimension.
  • Figure 5: Influence of the output visual token number $k$ on MME and VQAT.
  • ...and 1 more figures