Table of Contents
Fetching ...

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

Shu Zhao, Xiaohan Zou, Tan Yu, Huijuan Xu

TL;DR

A novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning and a framework termed Reconstruct before Query (RebQ), which effectively reconstructs the missing modality information and retains pre-trained knowledge.

Abstract

Pre-trained large multi-modal models (LMMs) exploit fine-tuning to adapt diverse user applications. Nevertheless, fine-tuning may face challenges due to deactivated sensors (e.g., cameras turned off for privacy or technical issues), yielding modality-incomplete data and leading to inconsistency in training data and the data for inference. Additionally, continuous training leads to catastrophic forgetting, diluting the knowledge in pre-trained LMMs. To overcome these challenges, we introduce a novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning. Our preliminary benchmarks reveal that existing methods suffer from a significant performance drop in CMML, even with the aid of advanced continual learning techniques. Therefore, we devise a framework termed Reconstruct before Query (RebQ). It decomposes prompts into modality-specific ones and breaks them into components stored in pools accessible via a key-query mechanism, which facilitates ParameterEfficient Fine-Tuning and enhances knowledge transferability for subsequent tasks. Meanwhile, our RebQ leverages extensive multi-modal knowledge from pre-trained LMMs to reconstruct the data of missing modality. Comprehensive experiments demonstrate that RebQ effectively reconstructs the missing modality information and retains pre-trained knowledge. Specifically, compared with the baseline, RebQ improves average precision from 20.00 to 50.92 and decreases average forgetting from 75.95 to 8.56. Code and datasets are available on https://github.com/Tree-Shu-Zhao/RebQ.pytorch

Reconstruct before Query: Continual Missing Modality Learning with Decomposed Prompt Collaboration

TL;DR

A novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning and a framework termed Reconstruct before Query (RebQ), which effectively reconstructs the missing modality information and retains pre-trained knowledge.

Abstract

Pre-trained large multi-modal models (LMMs) exploit fine-tuning to adapt diverse user applications. Nevertheless, fine-tuning may face challenges due to deactivated sensors (e.g., cameras turned off for privacy or technical issues), yielding modality-incomplete data and leading to inconsistency in training data and the data for inference. Additionally, continuous training leads to catastrophic forgetting, diluting the knowledge in pre-trained LMMs. To overcome these challenges, we introduce a novel task, Continual Missing Modality Learning (CMML), to investigate how models can generalize when data of certain modalities is missing during continual fine-tuning. Our preliminary benchmarks reveal that existing methods suffer from a significant performance drop in CMML, even with the aid of advanced continual learning techniques. Therefore, we devise a framework termed Reconstruct before Query (RebQ). It decomposes prompts into modality-specific ones and breaks them into components stored in pools accessible via a key-query mechanism, which facilitates ParameterEfficient Fine-Tuning and enhances knowledge transferability for subsequent tasks. Meanwhile, our RebQ leverages extensive multi-modal knowledge from pre-trained LMMs to reconstruct the data of missing modality. Comprehensive experiments demonstrate that RebQ effectively reconstructs the missing modality information and retains pre-trained knowledge. Specifically, compared with the baseline, RebQ improves average precision from 20.00 to 50.92 and decreases average forgetting from 75.95 to 8.56. Code and datasets are available on https://github.com/Tree-Shu-Zhao/RebQ.pytorch
Paper Structure (16 sections, 14 equations, 5 figures, 8 tables, 2 algorithms)

This paper contains 16 sections, 14 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Large Multi-Modal Models (LMMs) pre-trained on a huge amount of multi-modal data often necessitate subsequent Parameter-Efficient Fine-Tuning (PEFT) to adapt to diverse user applications. However, deactivated sensors (e.g., cameras turned off for privacy or technical issues) yield modality-incomplete data, significantly challenging the fine-tuning process.
  • Figure 2: Pipeline of the RebQ framework (Text modality is unavailable in the figure). The parameters of LMMs are frozen, and introducing prompt learning enables parameter-efficient fine-tuning on user devices. When meeting the missing modality, the available modality is utilized to generate memory prompts from a Memory pool via a key-query mechanism, and memory prompts are inserted into LMMs to reconstruct the missing modality as the modality-specific query. Then, the two modality-specific queries are utilized to generate modality-specific prompts from a text pool (Folder) and a visual pool (Album), which are used to learn downstream tasks.
  • Figure 3: Results of each incremental stage on UPMC-Food101-CMML. $\eta$ is $70\%$.
  • Figure 4: Prompt designs. (a) The pool size of memory/image/text pool. (b) The layers that are inserted prompts. (c) The prompt length.
  • Figure 5: t-SNE visualization of Reconstruction. Each point represents a query embedding of dimension $768$.