Table of Contents
Fetching ...

Efficient Prompting for Continual Adaptation to Missing Modalities

Zirun Guo, Shulei Wang, Wang Lin, Weicai Yan, Yangyang Wu, Tao Jin

TL;DR

This work tackles missing modalities in continual multimodal learning by framing it as a domain-incremental problem and introducing three prompt types—modality-specific, task-aware, and task-specific—along with a contrastive task-interaction strategy. The prompts enable intra-/inter-modality and intra-/inter-task feature learning with minimal parameter overhead, achieving strong exemplar-free performance across CMU-MOSI, IEMOCAP, and CH-SIMS while maintaining parameter efficiency (~2–3% of backbone). A contrastive loss aligns task-aware prompts across related missing-modality tasks, mitigating catastrophic forgetting in dynamic data streams. Extensive ablations validate the effectiveness of each prompt type, their arrangement, and the overall objective, highlighting practical gains for real-world multimodal systems facing ongoing missing-modalities challenges.

Abstract

Missing modality issues are common in real-world applications, arising from factors such as equipment failures and privacy concerns. When fine-tuning pre-trained models on downstream datasets with missing modalities, performance can degrade significantly. Current methods often aggregate various missing cases to train recovery modules or align multimodal features, resulting in suboptimal performance, high computational costs, and the risk of catastrophic forgetting in continual environments where data arrives sequentially. In this paper, we formulate the dynamic missing modality problem as a continual learning task and introduce the continual multimodal missing modality task. To address this challenge efficiently, we introduce three types of prompts: modality-specific, task-aware, and task-specific prompts. These prompts enable the model to learn intra-modality, inter-modality, intra-task, and inter-task features. Furthermore, we propose a contrastive task interaction strategy to explicitly learn prompts correlating different modalities. We conduct extensive experiments on three public datasets, where our method consistently outperforms state-of-the-art approaches.

Efficient Prompting for Continual Adaptation to Missing Modalities

TL;DR

This work tackles missing modalities in continual multimodal learning by framing it as a domain-incremental problem and introducing three prompt types—modality-specific, task-aware, and task-specific—along with a contrastive task-interaction strategy. The prompts enable intra-/inter-modality and intra-/inter-task feature learning with minimal parameter overhead, achieving strong exemplar-free performance across CMU-MOSI, IEMOCAP, and CH-SIMS while maintaining parameter efficiency (~2–3% of backbone). A contrastive loss aligns task-aware prompts across related missing-modality tasks, mitigating catastrophic forgetting in dynamic data streams. Extensive ablations validate the effectiveness of each prompt type, their arrangement, and the overall objective, highlighting practical gains for real-world multimodal systems facing ongoing missing-modalities challenges.

Abstract

Missing modality issues are common in real-world applications, arising from factors such as equipment failures and privacy concerns. When fine-tuning pre-trained models on downstream datasets with missing modalities, performance can degrade significantly. Current methods often aggregate various missing cases to train recovery modules or align multimodal features, resulting in suboptimal performance, high computational costs, and the risk of catastrophic forgetting in continual environments where data arrives sequentially. In this paper, we formulate the dynamic missing modality problem as a continual learning task and introduce the continual multimodal missing modality task. To address this challenge efficiently, we introduce three types of prompts: modality-specific, task-aware, and task-specific prompts. These prompts enable the model to learn intra-modality, inter-modality, intra-task, and inter-task features. Furthermore, we propose a contrastive task interaction strategy to explicitly learn prompts correlating different modalities. We conduct extensive experiments on three public datasets, where our method consistently outperforms state-of-the-art approaches.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The difference between existing methods and ours. Existing methods train all cases of data together, which is infeasible in many real-world scenarios. In contrast, we formulate it as a continual learning problem, which is much closer to real situations.
  • Figure 2: The performance of existing methods will degrade when applied to continual multimodal missing modality task.
  • Figure 3: The overall architecture of our proposed method. After the projection layer, modality-specific prompts, task-aware prompts and task-specific prompts are attached to multi-head self-attention (MSA) layers sequentially. Task-aware prompts are generated from modality-specific prompts and missing keys using Eq.(\ref{['e2']}).
  • Figure 4: t-SNE visualization of task-aware prompts on the CMU-MOSI dataset. Each point represents a prompt vector. Tasks 1-7 are shown in Table \ref{['denotation']}.
  • Figure 5: Quantitative results on the CMU-MOSI dataset with different prompt lengths $\ell$.