Table of Contents
Fetching ...

Continual Cross-Modal Generalization

Yan Xia, Hai Huang, Minghui Fang, Zhou Zhao

TL;DR

This work tackles the challenge of building a unified multimodal representation when abundant paired data across many modalities is impractical. It proposes COMET, a continual cross-modal framework that combines a Continual Mixture of Experts Adapter (CMoE-Adapter) with a Pseudo-Modality Replay (PMR) mechanism to incrementally map new modalities into an expanding shared discrete codebook. By using a mediator modality as a bridge and enforcing memory-preserving objectives, COMET achieves strong zero-shot cross-modal generalization across video-text, audio-text, image-text, speech-text, and related tasks, validated through comprehensive pre-training and downstream evaluations. The approach offers scalable, flexible learning for diverse multimodal data, with significant performance gains on unseen data pairs and robust transfer to seen pairs, aided by dynamic codebook expansion and ablations that highlight the importance of PMR and MoE components.

Abstract

Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

Continual Cross-Modal Generalization

TL;DR

This work tackles the challenge of building a unified multimodal representation when abundant paired data across many modalities is impractical. It proposes COMET, a continual cross-modal framework that combines a Continual Mixture of Experts Adapter (CMoE-Adapter) with a Pseudo-Modality Replay (PMR) mechanism to incrementally map new modalities into an expanding shared discrete codebook. By using a mediator modality as a bridge and enforcing memory-preserving objectives, COMET achieves strong zero-shot cross-modal generalization across video-text, audio-text, image-text, speech-text, and related tasks, validated through comprehensive pre-training and downstream evaluations. The approach offers scalable, flexible learning for diverse multimodal data, with significant performance gains on unseen data pairs and robust transfer to seen pairs, aided by dynamic codebook expansion and ablations that highlight the importance of PMR and MoE components.

Abstract

Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

Paper Structure

This paper contains 23 sections, 6 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The overview of our proposed continual unified multimodal representation framework, we use two stages as an example. We replicate the codebook obtained from the previous phase for use in the new phase, extend it, and continuously update it during subsequent training. However, during the PMR process, the codes acquired from the previous phase remain unchanged.
  • Figure 2: Ablation on the number of experts
  • Figure 3: Visualization of discrete codes. Red indicates codes effectively activated by all three modalities (video, audio, and text), green represents codes effectively activated by two modalities, blue denotes codes effectively activated by one modality, and black signifies codes that are not effectively activated.