Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning
Fahad Sarfraz, Bahram Zonooz, Elahe Arani
TL;DR
This work tackles catastrophic forgetting in continual learning by arguing that integrating multiple modalities, as the brain does, can yield robust representations and better knowledge retention. It introduces the Multimodal Continual Learning (MMCL) benchmark based on a VGGSound-derived subset to evaluate class-, domain-, and generalized-class incremental learning across unimodal and multimodal data. The proposed SAMM method combines rehearsal with structure-aware alignment and dynamic inference to fuse visual and audio signals while preserving modality-specific information, guided by losses for task performance, consistency, and relational alignment. Empirical results show that SAMM improves both stability and plasticity and reduces task recency bias relative to unimodal baselines and standard ER, establishing a strong baseline for multimodal continual learning on realistic benchmarks and highlighting avenues for extending to additional modalities such as language.
Abstract
While humans excel at continual learning (CL), deep neural networks (DNNs) exhibit catastrophic forgetting. A salient feature of the brain that allows effective CL is that it utilizes multiple modalities for learning and inference, which is underexplored in DNNs. Therefore, we study the role and interactions of multiple modalities in mitigating forgetting and introduce a benchmark for multimodal continual learning. Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations. This makes the model less vulnerable to modality-specific regularities and considerably mitigates forgetting. Furthermore, we observe that individual modalities exhibit varying degrees of robustness to distribution shift. Finally, we propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality. Our method sets a strong baseline that enables both single- and multimodal inference. Our study provides a promising case for further exploring the role of multiple modalities in enabling CL and provides a standard benchmark for future research.
