Table of Contents
Fetching ...

BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Ming Yang, Lan-Zhe Guo

TL;DR

BMIP tackles the limitation of single-modal and uni-directional prompt learning in vision-language models by introducing deep language and deep vision prompts plus a bi-directional interaction module that uses attention-derived weights to adapt prompts across modalities. The method employs learnable projection heads and a dynamic aggregation mechanism to fuse cross-modal information, with Corollary 1 supporting improved trainability and alignment. An open-world generalization evaluation paradigm is introduced to provide a more realistic assessment, and BMIP achieves state-of-the-art performance across base-to-novel, cross-dataset transfer, and domain generalization on 11–15 benchmarks, while remaining compatible with other prompt methods such as MaPLe, PromptSRC, and CoPrompt. The findings demonstrate BMIP’s ability to handle imbalanced text-image datasets and enhance multi-modal consistency, offering a robust foundation for future multi-modal prompting research.

Abstract

Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt (BMIP)$, which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.

BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

TL;DR

BMIP tackles the limitation of single-modal and uni-directional prompt learning in vision-language models by introducing deep language and deep vision prompts plus a bi-directional interaction module that uses attention-derived weights to adapt prompts across modalities. The method employs learnable projection heads and a dynamic aggregation mechanism to fuse cross-modal information, with Corollary 1 supporting improved trainability and alignment. An open-world generalization evaluation paradigm is introduced to provide a more realistic assessment, and BMIP achieves state-of-the-art performance across base-to-novel, cross-dataset transfer, and domain generalization on 11–15 benchmarks, while remaining compatible with other prompt methods such as MaPLe, PromptSRC, and CoPrompt. The findings demonstrate BMIP’s ability to handle imbalanced text-image datasets and enhance multi-modal consistency, offering a robust foundation for future multi-modal prompting research.

Abstract

Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called , which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
Paper Structure (16 sections, 13 equations, 2 figures, 5 tables)

This paper contains 16 sections, 13 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (a) Adapting language modality representations for downstream tasks through language prompt learning. (b) Adapting vision modality representations for downstream tasks through vision prompt learning. (c) Achieving uni-directional modality interaction by converting language prompts into vision prompts. (d) Achieving bi-directional modality interaction by aligning two modalities' representations through aggregating information between the vision and language modalities. (: learnable; : frozen during training)
  • Figure 2: Overview of proposed BMIP method. BMIP finetunes the layered prompts for the vision and language branches while freezing the rest of the model parameters. It utilizes an aggregation function to modulate cross-modal prompt influences, thus enabling effective information exchange between the two modalities.