Table of Contents
Fetching ...

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Yongzhu Miao, Shasha Li, Jintao Tang, Ting Wang

TL;DR

MuDPT tackles the misalignment problem caused by uni-modal prompt tuning in vision-language models by introducing multi-modal deep-symphysis prompt tuning with an Injection Model that enables cross-modality attention and hierarchical fusion. The approach symmetrically injects deep textual and visual prompts and learns a lightweight, modality-agnostic transformer to fuse prompts across modalities, while keeping the CLIP backbones frozen. Empirical results across 11 datasets show improved few-shot visual recognition and strong cross-dataset generalization, with notable gains over CoOp and CoCoOp, indicating more robust alignment between textual and visual representations. The work highlights practical benefits for adapting VL-PTMs to downstream tasks and suggests future work to further close the gap with hand-crafted prompts in zero-shot settings.

Abstract

Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

TL;DR

MuDPT tackles the misalignment problem caused by uni-modal prompt tuning in vision-language models by introducing multi-modal deep-symphysis prompt tuning with an Injection Model that enables cross-modality attention and hierarchical fusion. The approach symmetrically injects deep textual and visual prompts and learns a lightweight, modality-agnostic transformer to fuse prompts across modalities, while keeping the CLIP backbones frozen. Empirical results across 11 datasets show improved few-shot visual recognition and strong cross-dataset generalization, with notable gains over CoOp and CoCoOp, indicating more robust alignment between textual and visual representations. The work highlights practical benefits for adapting VL-PTMs to downstream tasks and suggests future work to further close the gap with hand-crafted prompts in zero-shot settings.

Abstract

Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.
Paper Structure (11 sections, 11 equations, 5 figures, 4 tables)

This paper contains 11 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between our approach (MuDPT) and existing prompt tuning approaches (text prompt tuning and visual prompt tuning).
  • Figure 2: Overview of CLIP.
  • Figure 3: Overview of our approach: MuDPT (Multi-modal Deep-symphysis Prompt Tuning). MuDPT introduces textual and visual prompts to the text and image encoder. During training, only the parameters of prompts (drawn in pink blocks) are tuned while the backbone (drawn in blue blocks) is frozen. MuDPT realizes cross-modality prompt transformation and fusion by further learning a light Injection Model.
  • Figure 4: Overview of our Injection Model, which consists of a multi-head attention block to calculate cross-modality attention and a linear layer to adapt the dimension of prompts.
  • Figure 5: Base-to-new generalization results. We compare MuDPT with zero shot CLIP, CoOp, CoCoOp (current SOTA). On base classes, MuDPT outperforms CoCoOp on all 11 datasets. On new classes, MuDPT outperforms CoOp and CoCoOp on all 11 datasets with an exception of 1.13% accuracy decrease on DTD dataset than CoCoOp.