Table of Contents
Fetching ...

SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong

TL;DR

SwimVG tackles visual grounding by replacing bulky vision–language transformer stacks with a lightweight, parameter-efficient strategy based on step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA). By freezing the vision and language backbones and injecting prompts and adapters across shallow-to-deep layers, SwimVG achieves progressive token-level fusion and robust cross-modal interaction with a tiny tunable budget. On RefCOCO, RefCOCO+, RefCOCOg, and Flickr30K Entities, it delivers state-of-the-art accuracy while offering substantial efficiency, including a practical reduction to about $2.04\%$ of tunable parameters and roughly $40\%$ faster inference. Ablation studies show complementary benefits from Swip and CIA, with domain adapters further enhancing text representations for VG. The approach holds promise for scalable, efficient multimodal grounding and could extend to related tasks such as VQA and video captioning.

Abstract

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.

SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

TL;DR

SwimVG tackles visual grounding by replacing bulky vision–language transformer stacks with a lightweight, parameter-efficient strategy based on step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA). By freezing the vision and language backbones and injecting prompts and adapters across shallow-to-deep layers, SwimVG achieves progressive token-level fusion and robust cross-modal interaction with a tiny tunable budget. On RefCOCO, RefCOCO+, RefCOCOg, and Flickr30K Entities, it delivers state-of-the-art accuracy while offering substantial efficiency, including a practical reduction to about of tunable parameters and roughly faster inference. Ablation studies show complementary benefits from Swip and CIA, with domain adapters further enhancing text representations for VG. The approach holds promise for scalable, efficient multimodal grounding and could extend to related tasks such as VQA and video captioning.

Abstract

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.

Paper Structure

This paper contains 21 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of multimodal fusion strategy between (a) mainstream framework and (b) SwimVG (ours) for visual grounding. Freezing the pre-trained models () and only updating () the tiny modules in SwimVG reduces 97.96% updated parameters while achieving even stronger performance.
  • Figure 2: Overall architecture of the proposed SwinVG, which freezes the pre-trained vision encoder and language encoder. SwimVG integrates step-wise multimodal prompts (Swip) and cross-modal interactive adapters, which bridges the visual and language encoders, ensuring the visual encoder concentrates on the text-relevant areas.
  • Figure 3: The Domian-specific adapter and cross-modal interactivate adapter.
  • Figure 4: Visualizations of attention maps, prediction results (yellow bounding boxes) and ground truth (red bounding boxes).
  • Figure 5: The convergence comparison between SwimVG and other SOTA models on RefCOCO.
  • ...and 2 more figures