Multi-Modal Continual Learning via Cross-Modality Adapters and Representation Alignment with Knowledge Preservation
Evelyn Chee, Wynne Hsu, Mong Li Lee
TL;DR
This work tackles multi-modal continual learning by integrating pre-trained modality-specific transformers through cross-modality adapters built as a mixture-of-experts. A representation alignment loss, coupled with a preservation term, ensures robust, modality-aware representations while mitigating forgetting across tasks. The MMEncoder selectively activates experts to model cross-modal interactions and freezes influential experts to preserve prior knowledge, with a joint objective combining classification, distillation, alignment, and preservation losses. Empirical results across AVE, UESTC-MMEA, and SAMSEMO demonstrate superior accuracy and reduced forgetting compared to strong baselines, including other PTM-based approaches. The framework advances scalable, cross-modal continual learning with practical implications for systems that must adapt to new tasks and modalities over time.
Abstract
Continual learning is essential for adapting models to new tasks while retaining previously acquired knowledge. While existing approaches predominantly focus on uni-modal data, multi-modal learning offers substantial benefits by utilizing diverse sensory inputs, akin to human perception. However, multi-modal continual learning presents additional challenges, as the model must effectively integrate new information from various modalities while preventing catastrophic forgetting. In this work, we propose a pre-trained model-based framework for multi-modal continual learning. Our framework includes a novel cross-modality adapter with a mixture-of-experts structure to facilitate effective integration of multi-modal information across tasks. We also introduce a representation alignment loss that fosters learning of robust multi-modal representations, and regularize relationships between learned representations to preserve knowledge from previous tasks. Experiments on several multi-modal datasets demonstrate that our approach consistently outperforms baselines in both class-incremental and domain-incremental learning, achieving higher accuracy and reduced forgetting.
