Table of Contents
Fetching ...

Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning

Fahad Sarfraz, Bahram Zonooz, Elahe Arani

TL;DR

This work tackles catastrophic forgetting in continual learning by arguing that integrating multiple modalities, as the brain does, can yield robust representations and better knowledge retention. It introduces the Multimodal Continual Learning (MMCL) benchmark based on a VGGSound-derived subset to evaluate class-, domain-, and generalized-class incremental learning across unimodal and multimodal data. The proposed SAMM method combines rehearsal with structure-aware alignment and dynamic inference to fuse visual and audio signals while preserving modality-specific information, guided by losses for task performance, consistency, and relational alignment. Empirical results show that SAMM improves both stability and plasticity and reduces task recency bias relative to unimodal baselines and standard ER, establishing a strong baseline for multimodal continual learning on realistic benchmarks and highlighting avenues for extending to additional modalities such as language.

Abstract

While humans excel at continual learning (CL), deep neural networks (DNNs) exhibit catastrophic forgetting. A salient feature of the brain that allows effective CL is that it utilizes multiple modalities for learning and inference, which is underexplored in DNNs. Therefore, we study the role and interactions of multiple modalities in mitigating forgetting and introduce a benchmark for multimodal continual learning. Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations. This makes the model less vulnerable to modality-specific regularities and considerably mitigates forgetting. Furthermore, we observe that individual modalities exhibit varying degrees of robustness to distribution shift. Finally, we propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality. Our method sets a strong baseline that enables both single- and multimodal inference. Our study provides a promising case for further exploring the role of multiple modalities in enabling CL and provides a standard benchmark for future research.

Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning

TL;DR

This work tackles catastrophic forgetting in continual learning by arguing that integrating multiple modalities, as the brain does, can yield robust representations and better knowledge retention. It introduces the Multimodal Continual Learning (MMCL) benchmark based on a VGGSound-derived subset to evaluate class-, domain-, and generalized-class incremental learning across unimodal and multimodal data. The proposed SAMM method combines rehearsal with structure-aware alignment and dynamic inference to fuse visual and audio signals while preserving modality-specific information, guided by losses for task performance, consistency, and relational alignment. Empirical results show that SAMM improves both stability and plasticity and reduces task recency bias relative to unimodal baselines and standard ER, establishing a strong baseline for multimodal continual learning on realistic benchmarks and highlighting avenues for extending to additional modalities such as language.

Abstract

While humans excel at continual learning (CL), deep neural networks (DNNs) exhibit catastrophic forgetting. A salient feature of the brain that allows effective CL is that it utilizes multiple modalities for learning and inference, which is underexplored in DNNs. Therefore, we study the role and interactions of multiple modalities in mitigating forgetting and introduce a benchmark for multimodal continual learning. Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations. This makes the model less vulnerable to modality-specific regularities and considerably mitigates forgetting. Furthermore, we observe that individual modalities exhibit varying degrees of robustness to distribution shift. Finally, we propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality. Our method sets a strong baseline that enables both single- and multimodal inference. Our study provides a promising case for further exploring the role of multiple modalities in enabling CL and provides a standard benchmark for future research.
Paper Structure (25 sections, 6 equations, 11 figures, 5 tables)

This paper contains 25 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Multimodal continual learning (MMCL) benchmark is based on a subset of the VGGSound dataset (for increased accessibility to the research community), which covers five super categories with twenty sub-classes each. The three CL scenarios cover various challenges that a learning agent has to tackle in the real world and provide correspondence with unimodal benchmarks.
  • Figure 2: Taskwise performance of models trained with experience replay (1000 buffer size) on multimodal vs. unimodal (audio and visual) data on Seq-VGGSound. As we train on new tasks, T (y-axis), we monitor the performance on earlier trained tasks (x-axis). Multimodal training not only learns the new task better but also retains more performance of earlier tasks.
  • Figure 3: Comparison of experience replay with 1000 buffer size on multimodal data vs. unimodal (audio and visual) data on Seq-VGGSound. (a) shows mean accuracy on all the seen tasks (T) as training progresses. (b) provides the plasticity stability trade-off of the models while (c) compares the task recency bias. Learning with multimodal data mitigates forgetting, offers a better stability-plasticity trade-off, and reduces the bias toward recent tasks.
  • Figure 4: Semantic-Aware Multimodal (SAMM) approach leverages the relational structural information between data points in each modality to integrate and align information from different modalities. Modality-specific representations are learned in the visual and audio encoders, which are then aligned and consolidated to maintain a similar relational structure across the modality-specific representations and the fused representation space. This integration and alignment process enables the consolidation of knowledge across tasks.
  • Figure 5: Gradcam activations for different models. Multimodal ER attends more to the regions that are associated with the class labels and localizes the sound regions better. This is considerably improved with SAMM.
  • ...and 6 more figures