Continual Cross-Modal Generalization
Yan Xia, Hai Huang, Minghui Fang, Zhou Zhao
TL;DR
This work tackles the challenge of building a unified multimodal representation when abundant paired data across many modalities is impractical. It proposes COMET, a continual cross-modal framework that combines a Continual Mixture of Experts Adapter (CMoE-Adapter) with a Pseudo-Modality Replay (PMR) mechanism to incrementally map new modalities into an expanding shared discrete codebook. By using a mediator modality as a bridge and enforcing memory-preserving objectives, COMET achieves strong zero-shot cross-modal generalization across video-text, audio-text, image-text, speech-text, and related tasks, validated through comprehensive pre-training and downstream evaluations. The approach offers scalable, flexible learning for diverse multimodal data, with significant performance gains on unseen data pairs and robust transfer to seen pairs, aided by dynamic codebook expansion and ablations that highlight the importance of PMR and MoE components.
Abstract
Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.
