MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Jianjian Cao, Peng Ye, Shengze Li, Chong Yu, Yansong Tang, Jiwen Lu, Tao Chen
TL;DR
MADTP introduces a joint framework for accelerating Vision-Language Transformers by combining a Multimodal Alignment Guidance (MAG) module with a Dynamic Token Pruning (DTP) module. MAG aligns visual and textual representations using learnable tokens to produce cross-modal attention maps that guide token pruning, while DTP performs instance- and layer-wise adaptive pruning based on a composite Token Importance Score and a learnable threshold. The framework optimizes through a combined objective $L = L_{task} + \alpha L_{sim}$, where $L_{sim}$ enforces cross-modal alignment and pruning effectiveness. Extensive experiments across NLVR2, COCO, Flickr30k, and VQA demonstrate substantial GFLOPs reductions (e.g., up to $80\%$ on BLIP/NLVR2) with competitive performance, and ablations validate the contributions of MAG and DTP. MADTP shows strong cross-modal compression capability and practical potential for deploying VLTs in computationally constrained settings.
Abstract
Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token pruning research for compressing VLTs mainly follows a single-modality-based scheme yet ignores the critical role of aligning different modalities for guiding the token pruning process, causing the important tokens for one modality to be falsely pruned in another modality branch. Meanwhile, existing VLT pruning works also lack the flexibility to dynamically compress each layer based on different input samples. To this end, we propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs. Specifically, we first introduce a well-designed Multi-modality Alignment Guidance (MAG) module that can align features of the same semantic concept from different modalities, to ensure the pruned tokens are less important for all modalities. We further design a novel Dynamic Token Pruning (DTP) module, which can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of kinds of multimodal models while preserving competitive performance. Notably, when applied to the BLIP model in the NLVR2 dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation.
