Window Token Concatenation for Efficient Visual Large Language Models
Yifan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong
TL;DR
WiCo tackles the token-cost problem in Visual Large Language Models by projecting the visual token map $\mathbf{v} \in \mathbb{R}^{n \times D_v}$ to a smaller set $\mathbf{v}_l \in \mathbb{R}^{k \times D_l}$ using a 2D window concatenation strategy that is informed by adaptive tuning of the last $K_v$ layers of the vision encoder. WiCo+ extends this by upsampling visual tokens in the late layers of the LLM decoder, enabling better fine-grained visual understanding through a hierarchy of window-level and patch-level attention. Empirical results on LLaVA-1.5 and Shikra show that WiCo(+ ) achieves superior or near parity performance with only a fraction of the original tokens across general VQA and grounding tasks, reducing training and inference costs while preserving task sensitivity. The approach demonstrates significant practical impact for deploying high-resolution VLLMs and suggests promising directions for extending token-reduction techniques to video and other modalities.
Abstract
To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.
