Table of Contents
Fetching ...

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Linfeng Zhang, Siteng Huang, Honggang Chen

TL;DR

The paper tackles the efficiency challenges of high-resolution LVLMs employing dynamic cropping, where token compression methods designed for single-view models fall short. It introduces GlobalCom$^2$, a training-free, plug-and-play framework that uses global thumbnail information as a commander to adaptively compress local crops through a global-to-local strategy, coupled with holistic token evaluation. Key contributions include a systematic analysis of dynamic cropping, a per-crop adaptive retention mechanism with equations such as $s_j^G = \sum_{i \in crop_j} s_i^G$, $\tilde{s}_j = (s_j^G - \max(s_j^G))/\tau$, $\sigma_j = \frac{\exp(\tilde{s}_j)}{\sum_l \exp(\tilde{s}_l) + \epsilon}$, and $r_j = R \times (1 + \sigma_j - 1/n)$, and a holistic score $s_{j,i} = \alpha \hat{s}_{j,i}^{G} + (1-\alpha) s_{j,i}^{L}$ with $\alpha=0.5$. Experiments show GlobalCom$^2$ maintains >90% performance with 90% token reduction and delivers substantial FLOPs and memory savings on image and video tasks, highlighting its practical utility for efficient HR-LVLM deployment.

Abstract

Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose "Global Compression Commander" (GlobalCom$^2$), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the "commander" to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%. Our code is available at https://github.com/xuyang-liu16/GlobalCom2.

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

TL;DR

The paper tackles the efficiency challenges of high-resolution LVLMs employing dynamic cropping, where token compression methods designed for single-view models fall short. It introduces GlobalCom, a training-free, plug-and-play framework that uses global thumbnail information as a commander to adaptively compress local crops through a global-to-local strategy, coupled with holistic token evaluation. Key contributions include a systematic analysis of dynamic cropping, a per-crop adaptive retention mechanism with equations such as , , , and , and a holistic score with . Experiments show GlobalCom maintains >90% performance with 90% token reduction and delivers substantial FLOPs and memory savings on image and video tasks, highlighting its practical utility for efficient HR-LVLM deployment.

Abstract

Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose "Global Compression Commander" (GlobalCom), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom leverages thumbnail as the "commander" to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%. Our code is available at https://github.com/xuyang-liu16/GlobalCom2.
Paper Structure (16 sections, 8 equations, 8 figures, 6 tables)

This paper contains 16 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Design philosophy of "global-to-local" guided token compression. GlobalCom$^2$ evaluates the information richness of local crops from a global perspective to preserve informative regions while removing redundant ones.
  • Figure 2: Complementary roles of global thumbnail and local crops in HR-LVLMs with dynamic cropping. Performance (%) denotes relative scores of LLaVA-NeXT-7B.
  • Figure 3: Varying contributions of local crops. Importance is quantified by the accumulated attention scores between thumbnail patches and [CLS] token within each crop.
  • Figure 4: Content-agnostic positional bias. LLM attention-guided methods (e.g., FastV) assign higher scores (bars) to later tokens, regardless of their content or input order (second row: sequential crops; third row: reversed crops).
  • Figure 5: Overall framework. GlobalCom$^2$ guides token compression for HR-LVLMs through: 1) compressing thumbnail tokens (blue path), and 2) compressing crop tokens (yellow paths) by (a) adaptively adjusting compression intensity based on global visual richness, and (b) performing compression according to token informativeness from global and local perspectives.
  • ...and 3 more figures