Table of Contents
Fetching ...

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Zhenhailong Wang, Senthil Purushwalkam, Caiming Xiong, Silvio Savarese, Heng Ji, Ran Xu

TL;DR

DyMU tackles the inefficiency of vision-language models caused by fixed, high-token budgets in visual encoders. It introduces Dynamic Token Merging to adapt token counts to image complexity and Virtual Token Unmerging to emulate full RoPE-based attention in LLMs without fine-tuning. Through batch level thresholding and careful attention reweighting, DyMU achieves 32-85% average reductions in visual tokens while preserving performance across diverse VLMs, including AnyRes-based encoders, and remains training-free. The framework offers practical, controllable compute savings and robust compatibility, demonstrated by extensive experiments and qualitative analyses.

Abstract

We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

TL;DR

DyMU tackles the inefficiency of vision-language models caused by fixed, high-token budgets in visual encoders. It introduces Dynamic Token Merging to adapt token counts to image complexity and Virtual Token Unmerging to emulate full RoPE-based attention in LLMs without fine-tuning. Through batch level thresholding and careful attention reweighting, DyMU achieves 32-85% average reductions in visual tokens while preserving performance across diverse VLMs, including AnyRes-based encoders, and remains training-free. The framework offers practical, controllable compute savings and robust compatibility, demonstrated by extensive experiments and qualitative analyses.

Abstract

We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: https://mikewangwzhl.github.io/dymu/.

Paper Structure

This paper contains 32 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Method Overview.DyMU, is composed of two key ideas: Dynamic Token Merging (DToMe) and Virtual Token Unmerging (VTU). DToMe first determines per‐layer thresholds (left) by feeding a large batch of images into the vision transformer and computing bipartite token similarities. We rank these edges across the entire batch and choose the top-$Br$ ($r=$ desired average number of tokens, batch size $B$). This leads to more edges from simpler images (with more redundancy) being chosen, while complex images remain less merged. During inference, DToMe merges tokens on a per‐image basis using these pre-computed thresholds. We then apply VTU (right) in the self‐attention layers of the pretrained VLM to efficiently expand the attention matrices to the standard token count—ensuring the model’s original weights and outputs remain compatible—before re‐merging the tokens for the next layer. The overall process is training‐free and utilizes crucial image information by allocating the token budget more effectively for both simple and complex images.
  • Figure 2: Image Complexity vs Token Count and Accuracy The scatter plot (left) demonstrates a strong correlation between DyMU’s token count and image complexity score—more complex images naturally receive more tokens. On the right, MME accuracy at varying complexity levels is compared between ToMe (fixed-length) and DyMU (dynamic-length), highlighting the benefit of assigning additional tokens to complex images.
  • Figure 3: Importance of Virtual Token Unmerging(VTU). We ablate the performance of LLaVA 1.5 with two token reduction methods applied to the visual encoder—ToMe (fixed‐length) and DToMe (variable‐length). We observe that applying VTU significantly improves performance on 8 out of 9 benchmarks, demonstrating robustness to varied token reduction methods.
  • Figure 4: Comparing thresholds using LLaVA Instruct Data vs Pixmo-Cap. Although both methods use the same per‐layer merging hyperparameter ($r_i$ ), the Pixmo‐based thresholds lead to fewer tokens (top)—likely due to domain differences. However, performance across a range of benchmarks shows minimal drop (bottom), indicating the robustness of our threshold estimation.
  • Figure 5: Controllable Visual Token Length. By dynamically allocating tokens based on image complexity, DyMU enables direct control over computational cost. In these examples, we combine DyMU with additional vision tools—background removal, OCR, or object detection—to focus only on the relevant regions. As a result, token count is substantially reduced without degrading performance, showcasing the flexibility of DyMU to adapt token usage according to the task’s requirements.
  • ...and 1 more figures