Table of Contents
Fetching ...

Deep Correlated Prompting for Visual Recognition with Missing Modalities

Lianyu Hu, Tongkai Shi, Wei Feng, Fanhua Shang, Liang Wan

TL;DR

This work refers to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input, and incorporates the complementary semantics of different modalities to guide the prompting design for each modality.

Abstract

Large-scale multimodal models have shown excellent performance over a series of tasks powered by the large corpus of paired multimodal training data. Generally, they are always assumed to receive modality-complete inputs. However, this simple assumption may not always hold in the real world due to privacy constraints or collection difficulty, where models pretrained on modality-complete data easily demonstrate degraded performance on missing-modality cases. To handle this issue, we refer to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input. Instead of only prepending independent prompts to the intermediate layers, we present to leverage the correlations between prompts and input features and excavate the relationships between different layers of prompts to carefully design the instructions. We also incorporate the complementary semantics of different modalities to guide the prompting design for each modality. Extensive experiments on three commonly-used datasets consistently demonstrate the superiority of our method compared to the previous approaches upon different missing scenarios. Plentiful ablations are further given to show the generalizability and reliability of our method upon different modality-missing ratios and types.

Deep Correlated Prompting for Visual Recognition with Missing Modalities

TL;DR

This work refers to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input, and incorporates the complementary semantics of different modalities to guide the prompting design for each modality.

Abstract

Large-scale multimodal models have shown excellent performance over a series of tasks powered by the large corpus of paired multimodal training data. Generally, they are always assumed to receive modality-complete inputs. However, this simple assumption may not always hold in the real world due to privacy constraints or collection difficulty, where models pretrained on modality-complete data easily demonstrate degraded performance on missing-modality cases. To handle this issue, we refer to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input. Instead of only prepending independent prompts to the intermediate layers, we present to leverage the correlations between prompts and input features and excavate the relationships between different layers of prompts to carefully design the instructions. We also incorporate the complementary semantics of different modalities to guide the prompting design for each modality. Extensive experiments on three commonly-used datasets consistently demonstrate the superiority of our method compared to the previous approaches upon different missing scenarios. Plentiful ablations are further given to show the generalizability and reliability of our method upon different modality-missing ratios and types.

Paper Structure

This paper contains 16 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The overview of our proposed framework. We first select the prompt $P^T_m$ and $P^I_m$ with $m\in\{c, m_1, m_2\}$ for the text encoder and image encoder according to the missing case (e.g., complete, text-only, image-only) of the multimodal inputs ($x^{m1}$, $x^{m2}$). The prompt $P^T_m$ ($P^I_m$) is composed of three types of missing-aware prompts including the correlated prompts $P_m^{T,R}$ ($P_m^{I,R}$), dynamic prompts $P_m^{T,D}$ ($P_m^{I,D}$) and modal-common prompts $P_m^{T,C}$ ($P_m^{I,C}$). Then we prepend the prompts to the inputs and intermediate features of both encoders to instruct the model to fit the missing case. Finally, we concatenate the task-related token of both encoders as the final representation, and pass it through a fully-connected layer for class prediction. In the whole procedure, only the fully-connected (fc) layer and deep correlated prompts are updated while others keep frozen.
  • Figure 2: (1) Baseline, which simply uses fixed image encoder and text encoder and only finetunes the classifier to handle downstream tasks. (2) MMP, which inserts independent prompts at each layer to guide the model to handle missing-modality cases. (3) Correlated prompts, which generate the prompts of the next layer based on the prompts of both modalities in the current layer to enable cooperation of prompts from both modalities. (4) Dynamic prompts, which dynamically computes the prompts based on different input features to better guide the behavior of the model, avoiding using fixed prompts for different inputs. (5) Modal-common prompts, which store the shared information across different modalitie and facilitate the model to encode modal-specific information to better handle the missing scenarios in each modality.
  • Figure 3: Comparison of our final model (Ours) with (1) baseline, which directly drops the features when a modality is missing; (2) Ours (A), which only equips the correlated prompts; (3) Ours (B), which equips both the correlated prompts and the dynamic prompts. The experiments are conducted on the val set of MM-IMDb dataset arevalo2017gated across different missing rates (0–100%) upon three different missing-modality scenarios (missing-both, missing-image and missing-text).
  • Figure 4: Ablations for the correlated prompts.
  • Figure 5: Ablations for the configurations of dynamic prompts.
  • ...and 3 more figures