SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding
Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong
TL;DR
SwimVG tackles visual grounding by replacing bulky vision–language transformer stacks with a lightweight, parameter-efficient strategy based on step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA). By freezing the vision and language backbones and injecting prompts and adapters across shallow-to-deep layers, SwimVG achieves progressive token-level fusion and robust cross-modal interaction with a tiny tunable budget. On RefCOCO, RefCOCO+, RefCOCOg, and Flickr30K Entities, it delivers state-of-the-art accuracy while offering substantial efficiency, including a practical reduction to about $2.04\%$ of tunable parameters and roughly $40\%$ faster inference. Ablation studies show complementary benefits from Swip and CIA, with domain adapters further enhancing text representations for VG. The approach holds promise for scalable, efficient multimodal grounding and could extend to related tasks such as VQA and video captioning.
Abstract
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.
