Table of Contents
Fetching ...

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang

TL;DR

This work tackles the growing computational burden of vision-language Transformers by introducing CrossGET, a general token-ensemble acceleration framework that adaptively reduces token counts through bidirectional cross-modal guidance. It combines two core innovations: cross-guided matching and ensemble, which inject learnable cross tokens and compute cross-modal importance to guide token reduction, and complete-graph soft matching, a non-iterative, parallelizable algorithm that reliably matches tokens by considering all pairwise similarities. A dedicated loss, $\mathcal{L}_{JS}$, aligns cross-token representations across modalities while detaching projection layers to preserve existing training dynamics. Empirically, CrossGET delivers substantial speedups across modality-independent models (e.g., CLIP) and modality-dependent systems (e.g., BLIP, BLIP2, LLaVA) on tasks including image-text retrieval, visual reasoning, image captioning, and visual question answering, with minor or no degradation in task performance. The approach is complementary to other acceleration techniques and demonstrates strong practical impact for deploying multimodal transformers more efficiently.

Abstract

Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an algorithm for the token-matching mechanism, ensuring reliable matching results while facilitating parallelizability and high efficiency. Extensive experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering. The performance on both classic multimodal architectures and emerging multimodal LLMs demonstrates the framework's effectiveness and versatility. The code is available at https://github.com/sdc17/CrossGET.

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

TL;DR

This work tackles the growing computational burden of vision-language Transformers by introducing CrossGET, a general token-ensemble acceleration framework that adaptively reduces token counts through bidirectional cross-modal guidance. It combines two core innovations: cross-guided matching and ensemble, which inject learnable cross tokens and compute cross-modal importance to guide token reduction, and complete-graph soft matching, a non-iterative, parallelizable algorithm that reliably matches tokens by considering all pairwise similarities. A dedicated loss, , aligns cross-token representations across modalities while detaching projection layers to preserve existing training dynamics. Empirically, CrossGET delivers substantial speedups across modality-independent models (e.g., CLIP) and modality-dependent systems (e.g., BLIP, BLIP2, LLaVA) on tasks including image-text retrieval, visual reasoning, image captioning, and visual question answering, with minor or no degradation in task performance. The approach is complementary to other acceleration techniques and demonstrates strong practical impact for deploying multimodal transformers more efficiently.

Abstract

Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an algorithm for the token-matching mechanism, ensuring reliable matching results while facilitating parallelizability and high efficiency. Extensive experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering. The performance on both classic multimodal architectures and emerging multimodal LLMs demonstrates the framework's effectiveness and versatility. The code is available at https://github.com/sdc17/CrossGET.
Paper Structure (60 sections, 36 equations, 5 figures, 30 tables, 2 algorithms)

This paper contains 60 sections, 36 equations, 5 figures, 30 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of CrossGET.①CrossGET is a general multimodal token reduction framework that applies to both modality-independent and modality-dependent models. ②CrossGET jointly considers the token similarity derived from intra-modal complete-graph soft matching and the token importance indicated by cross-modal guidance to determine which tokens should be combined. The cross-modal importance is subsequently utilized to weight tokens within each stack and output their ensembles. ③ Compared with the original models, CrossGET achieves considerable computation saving and acceleration with negligible performance degradation.
  • Figure 2: Diagram of introducing and leveraging cross-model guidance for vision-language Transformers.① Cross tokens learn cross-modal information by closing the after-projection distance between cross tokens of different modalities. The switches indicate that it is free to choose whether to reduce tokens in different modalities and layers. ② Cross tokens provide cross-modal importance as a metric to guide token matching. ③ The metric also guides the weighted summation of the stacked tokens to produce token ensemble results.
  • Figure 3: Illustration of complete-graph soft matching on two examples. Case2 is an inverted version of case1 in which the similarity between token pairs in case2 equals ($1 -$ similarity of corresponding pairs in case1).
  • Figure 4: Performance-Cost tradeoffs in three situations: 1) The left subfigure illustrates the tradeoff for BLIP on the NVLR2 dataset of the Visual Reasoning task without training. 2) The upper-right subfigure illustrates the tradeoff for BLIP on the NVLR2 dataset of the Visual Reasoning task with training. 3) The lower-right subfigure illustrates the tradeoff for CLIP on the Flickr30K dataset of the Image-Text Retrieval task are trained with $50\%$ token reduced and then re-evaluated under other token reduction ratios without training.
  • Figure 5: Diagram of adding cross tokes to modality-independent models such as CLIP radford2021learning (left) and modality-dependent models such as BLIP/BLIP2 li2022blipli2023blip (right).