MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Jianjian Cao; Peng Ye; Shengze Li; Chong Yu; Yansong Tang; Jiwen Lu; Tao Chen

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen

TL;DR

MADTP introduces a joint framework for accelerating Vision-Language Transformers by combining a Multimodal Alignment Guidance (MAG) module with a Dynamic Token Pruning (DTP) module. MAG aligns visual and textual representations using learnable tokens to produce cross-modal attention maps that guide token pruning, while DTP performs instance- and layer-wise adaptive pruning based on a composite Token Importance Score and a learnable threshold. The framework optimizes through a combined objective $L = L_{task} + \alpha L_{sim}$, where $L_{sim}$ enforces cross-modal alignment and pruning effectiveness. Extensive experiments across NLVR2, COCO, Flickr30k, and VQA demonstrate substantial GFLOPs reductions (e.g., up to $80\%$ on BLIP/NLVR2) with competitive performance, and ablations validate the contributions of MAG and DTP. MADTP shows strong cross-modal compression capability and practical potential for deploying VLTs in computationally constrained settings.

Abstract

Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token pruning research for compressing VLTs mainly follows a single-modality-based scheme yet ignores the critical role of aligning different modalities for guiding the token pruning process, causing the important tokens for one modality to be falsely pruned in another modality branch. Meanwhile, existing VLT pruning works also lack the flexibility to dynamically compress each layer based on different input samples. To this end, we propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs. Specifically, we first introduce a well-designed Multi-modality Alignment Guidance (MAG) module that can align features of the same semantic concept from different modalities, to ensure the pruned tokens are less important for all modalities. We further design a novel Dynamic Token Pruning (DTP) module, which can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of kinds of multimodal models while preserving competitive performance. Notably, when applied to the BLIP model in the NLVR2 dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation.

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

TL;DR

, where

enforces cross-modal alignment and pruning effectiveness. Extensive experiments across NLVR2, COCO, Flickr30k, and VQA demonstrate substantial GFLOPs reductions (e.g., up to

on BLIP/NLVR2) with competitive performance, and ablations validate the contributions of MAG and DTP. MADTP shows strong cross-modal compression capability and practical potential for deploying VLTs in computationally constrained settings.

Abstract

Paper Structure (32 sections, 12 equations, 9 figures, 17 tables)

This paper contains 32 sections, 12 equations, 9 figures, 17 tables.

Introduction
Related Work
Vision-Language Transformer
Multimodal Compression
Token Merging and Pruning
Methodology
Preliminaries
Multi-modality Alignment Guidance
Dynamic Token Pruning
Objective Function
Experiments
Experimental Setup
Experiments on the Visual Reasoning Task
Experiments on the Retrieval Task
Experiments on the Image Caption Task
...and 17 more sections

Figures (9)

Figure 1: Comparison between our MADTP and other compression methods for the BLIP model tested on the NLVR2 dataset. STP represents the Static Token Pruning method, and MAG denotes our Multi-modality Alignment Guidance module.
Figure 1: Visualization comparisons of token pruning results between STP and MADTP, providing strong evidence that our approach emphasizes modality correlation, effectively avoids pruning crucial tokens and dynamically adjusts pruning ratio according to inputs.
Figure 2: Overview of the proposed MADTP framework. It comprises two main components: the Multi-modality Alignment Guidance (MAG) module and the Dynamic Token Pruning (DTP) module. The MAG module is placed between the vision and language branches in VLTs, facilitating explicit alignment of representations across modalities and offering guidance for token pruning. Meanwhile, the DTP module is incorporated within each transformer block, allowing for dynamic token pruning based on the complexity of input instances.
Figure 2: Comparisons of MADTP token pruning in each transformer block for samples of different instance complexity levels, including Easy, Middle, and Hard samples. The density represents the ratio of retained tokens to the total number of original tokens.
Figure 3: Visualization of token pruning results between STP and MADTP, providing strong evidence that our approach emphasizes modality correlation and effectively avoids pruning crucial tokens.
...and 4 more figures

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

TL;DR

Abstract

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (9)