BMIP: Bi-directional Modality Interaction Prompt Learning for VLM
Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Ming Yang, Lan-Zhe Guo
TL;DR
BMIP tackles the limitation of single-modal and uni-directional prompt learning in vision-language models by introducing deep language and deep vision prompts plus a bi-directional interaction module that uses attention-derived weights to adapt prompts across modalities. The method employs learnable projection heads and a dynamic aggregation mechanism to fuse cross-modal information, with Corollary 1 supporting improved trainability and alignment. An open-world generalization evaluation paradigm is introduced to provide a more realistic assessment, and BMIP achieves state-of-the-art performance across base-to-novel, cross-dataset transfer, and domain generalization on 11–15 benchmarks, while remaining compatible with other prompt methods such as MaPLe, PromptSRC, and CoPrompt. The findings demonstrate BMIP’s ability to handle imbalanced text-image datasets and enhance multi-modal consistency, offering a robust foundation for future multi-modal prompting research.
Abstract
Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt (BMIP)$, which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
