Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

Shu Zhao; Xiaohan Zou; Tan Yu; Huijuan Xu

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

Shu Zhao, Xiaohan Zou, Tan Yu, Huijuan Xu

TL;DR

A novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning and a framework termed Reconstruct before Query (RebQ), which effectively reconstructs the missing modality information and retains pre-trained knowledge.

Abstract

Pre-trained large multi-modal models (LMMs) exploit fine-tuning to adapt diverse user applications. Nevertheless, fine-tuning may face challenges due to deactivated sensors (e.g., cameras turned off for privacy or technical issues), yielding modality-incomplete data and leading to inconsistency in training data and the data for inference. Additionally, continuous training leads to catastrophic forgetting, diluting the knowledge in pre-trained LMMs. To overcome these challenges, we introduce a novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning. Our preliminary benchmarks reveal that existing methods suffer from a significant performance drop in CMML, even with the aid of advanced continual learning techniques. Therefore, we devise a framework termed Reconstruct before Query (RebQ). It decomposes prompts into modality-specific ones and breaks them into components stored in pools accessible via a key-query mechanism, which facilitates ParameterEfficient Fine-Tuning and enhances knowledge transferability for subsequent tasks. Meanwhile, our RebQ leverages extensive multi-modal knowledge from pre-trained LMMs to reconstruct the data of missing modality. Comprehensive experiments demonstrate that RebQ effectively reconstructs the missing modality information and retains pre-trained knowledge. Specifically, compared with the baseline, RebQ improves average precision from 20.00 to 50.92 and decreases average forgetting from 75.95 to 8.56. Code and datasets are available on https://github.com/Tree-Shu-Zhao/RebQ.pytorch

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

TL;DR

Abstract

Paper Structure (16 sections, 14 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 16 sections, 14 equations, 5 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Missing Modality
Continual Learning
Method
Problem Definition
Modality-Specific Prompt Learning
Missing Query Reconstruction
Multi-modality Prompt Collaboration
Experiments
Main Results
Ablation Studies
Conclusion
Comparison of Different Backbones
Analysis of Reconstruction
...and 1 more sections

Figures (5)

Figure 1: Large Multi-Modal Models (LMMs) pre-trained on a huge amount of multi-modal data often necessitate subsequent Parameter-Efficient Fine-Tuning (PEFT) to adapt to diverse user applications. However, deactivated sensors (e.g., cameras turned off for privacy or technical issues) yield modality-incomplete data, significantly challenging the fine-tuning process.
Figure 2: Pipeline of the RebQ framework (Text modality is unavailable in the figure). The parameters of LMMs are frozen, and introducing prompt learning enables parameter-efficient fine-tuning on user devices. When meeting the missing modality, the available modality is utilized to generate memory prompts from a Memory pool via a key-query mechanism, and memory prompts are inserted into LMMs to reconstruct the missing modality as the modality-specific query. Then, the two modality-specific queries are utilized to generate modality-specific prompts from a text pool (Folder) and a visual pool (Album), which are used to learn downstream tasks.
Figure 3: Results of each incremental stage on UPMC-Food101-CMML. $\eta$ is $70\%$.
Figure 4: Prompt designs. (a) The pool size of memory/image/text pool. (b) The layers that are inserted prompts. (c) The prompt length.
Figure 5: t-SNE visualization of Reconstruction. Each point represents a query embedding of dimension $768$.

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

TL;DR

Abstract

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)