MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

Ruiting Dai; Yuqiao Tan; Lisi Mo; Tao He; Ke Qin; Shuang Liang

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

Ruiting Dai, Yuqiao Tan, Lisi Mo, Tao He, Ke Qin, Shuang Liang

TL;DR

This paper proposes a novel Multi-step Adaptive Prompt Learning (MuAP) framework, aiming to generate multimodal prompts and perform multi-step prompt tuning, which adaptively learns knowledge by iteratively aligning modalities.

Abstract

Recently, prompt learning has garnered considerable attention for its success in various Vision-Language (VL) tasks. However, existing prompt-based models are primarily focused on studying prompt generation and prompt strategies with complete modality settings, which does not accurately reflect real-world scenarios where partial modality information may be missing. In this paper, we present the first comprehensive investigation into prompt learning behavior when modalities are incomplete, revealing the high sensitivity of prompt-based models to missing modalities. To this end, we propose a novel Multi-step Adaptive Prompt Learning (MuAP) framework, aiming to generate multimodal prompts and perform multi-step prompt tuning, which adaptively learns knowledge by iteratively aligning modalities. Specifically, we generate multimodal prompts for each modality and devise prompt strategies to integrate them into the Transformer model. Subsequently, we sequentially perform prompt tuning from single-stage and alignment-stage, allowing each modality-prompt to be autonomously and adaptively learned, thereby mitigating the imbalance issue caused by only textual prompts that are learnable in previous works. Extensive experiments demonstrate the effectiveness of our MuAP and this model achieves significant improvements compared to the state-of-the-art on all benchmark datasets

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

TL;DR

Abstract

Paper Structure (37 sections, 12 equations, 7 figures, 5 tables)

This paper contains 37 sections, 12 equations, 7 figures, 5 tables.

Introduction
Related work
Vision-Language Pre-trained Model
Prompt Learning for Vision-Language Tasks
Method
Problem Definition
Overall Framework
Revisiting ViLT
Multimodal Prompt Generator
Prompt Strategy Design
Head-fusion Prompting.
Cross-fusion Prompting.
Multi-step Prompt Tuning
Single-stage prompt tuning.
Alignment-stage prompt tuning.
...and 22 more sections

Figures (7)

Figure 1: Various architectures in the prompt tuning field. (a) The CLIP-family method maple focus on prompt generation with complete modality information. (b) Missing-aware prompts method in MPVR map has $2^C-1$ prompts to represent all missing scenarios, where C is the number of modalities. (c) Our method aims to enhance parameter efficiency by utilizing only $C$ prompts and to improve robustness through multi-step prompting tuning in missing scenarios.
Figure 2: The overview of our MuAP framework. The Multimodal Prompt Generator initially generates complete-type prompts, $P_{m_t}$ and $P_{m_v}$, tailored to the specific modality case (e.g., textual or visual modalities in Vision-Language tasks). Next, it employs $f_\mathsf{missing}$ to create missing-type prompts $\Tilde{P}_{m_t}$ and $\Tilde{P}_{m_v}$ . The Prompt Strategy Design module integrates prompts into multiple MSA layers using various strategies (i.e., head fusion or cross fusion). During the training phase, we leverage Multi-step Prompt Tuning to synchronize distinct characteristics of different modality prompts effectively.
Figure 3: Comparison of baselines on the Hateful Memes dataset with different missing rates across various missing-modality scenarios. Each point in the picture represents training and testing with the same $\epsilon\%$ missing rate.
Figure 4: Ablation study on prompt length for head-fusion strategy. All models are trained and evaluated on various scenarios (e.g., missing-image) with $\epsilon$=70%.
Figure 5: Detailed examples for three benchmark datasets.
...and 2 more figures

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

TL;DR

Abstract

MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality

Authors

TL;DR

Abstract

Table of Contents

Figures (7)