DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Aditya Kumar Singh; Hitesh Kandala; Pratik Prabhanjan Brahma; Zicheng Liu; Emad Barsoum

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

TL;DR

Vision-language models incur high cost from dense visual tokens. DUET-VLM introduces a dual-stage compression, combining vision-side redundancy-aware token merging with language-side text-guided token dropping to allow aggressive token reduction without significant accuracy loss. It demonstrates strong results on image and video benchmarks, including near-baseline accuracy at substantial token reductions and faster training times, outperforming prior token-efficiency methods. The approach highlights the value of training-aware, joint token management for scalable, high-performance VLMs, and code is released for community use.

Abstract

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

TL;DR

Abstract

Paper Structure (35 sections, 8 figures, 15 tables, 1 algorithm)

This paper contains 35 sections, 8 figures, 15 tables, 1 algorithm.

Introduction
Our Perspective.
Our Approach.
Related Works
Token Efficiency in VLMs.
Joint Multimodal Compression.
Methodology
Clustering of vision tokens
Motivation for local clustering
a. Semantic misalignment
b. Information dilution
Local Cluster Aggregation.
Hierarchical visual tokens dropping
Text-Guided Visual Token Dropping.
Experiment
...and 20 more sections

Figures (8)

Figure 1: Efficiency and accuracy comparison of DUET-VLM. (a) shows the average inference-only accuracy of VisionZip, PyramidDrop, and DUET-VLM across different token budgets, compared to the full 576-token LLaVA-1.5-7B baseline on four benchmarks. (b) demonstrates that trained DUET-VLM model achieves a 31% reduction in training time while incurring less than a 1% drop in accuracy relative to the LLaVA-1.5-7B baseline.
Figure 2: Overview of the proposed pipeline. An input image is first encoded into $N$ visual tokens by the Vision Encoder. (A) Based on the V2V self-attention map $A_{v2v}^{.,i} \coloneqq 1/N\sum_jA_{v2v}^{j,i}$, we select the most influential Top-$k_1$dominant tokens, while the remaining tokens, $\mathbf{X}_{\text{res}}$, are merged into $k_2$contextual tokens via attention-guided clustering with a fixed cluster width $w$, to reducing redundancy. (B) The resulting reduced visual tokens, $\mathbf{X}_{\text{out}}$, are fed into a language backbone after projecting it through an MLP Adapter, where salient text tokens, $\mathcal{S}$, further prune visual tokens based on cross-attention scores $A_{t2v}$ at certain selected layers called stage.
Figure 3: Ablation on varying cluster width of DUET-VLM (C) on VQAT benchmark on LLaVA-1.5-7B for different token budgets
Figure 4: Attention heatmap of salient text tokens attending to visual tokens at the 9th layer of the language backbone in DUET-VLM (C+S).
Figure 5: Performance of various dropping configurations in the language backbone on TextVQA using all text tokens. Red crosses indicate configurations in which all visual tokens are removed at some layer prior to the final layer.
...and 3 more figures

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

TL;DR

Abstract

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (8)