MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models
Yongzhu Miao, Shasha Li, Jintao Tang, Ting Wang
TL;DR
MuDPT tackles the misalignment problem caused by uni-modal prompt tuning in vision-language models by introducing multi-modal deep-symphysis prompt tuning with an Injection Model that enables cross-modality attention and hierarchical fusion. The approach symmetrically injects deep textual and visual prompts and learns a lightweight, modality-agnostic transformer to fuse prompts across modalities, while keeping the CLIP backbones frozen. Empirical results across 11 datasets show improved few-shot visual recognition and strong cross-dataset generalization, with notable gains over CoOp and CoCoOp, indicating more robust alignment between textual and visual representations. The work highlights practical benefits for adapting VL-PTMs to downstream tasks and suggests future work to further close the gap with hand-crafted prompts in zero-shot settings.
Abstract
Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.
