Table of Contents
Fetching ...

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

Dianzhi Yu, Xinni Zhang, Yankai Chen, Aiwei Liu, Yifei Zhang, Philip S. Yu, Irwin King

TL;DR

This survey articulates a comprehensive view of multimodal continual learning (MMCL) by organizing methods into four families—regularization-based, architecture-based, replay-based, and prompt-based—while detailing MMCL scenarios, preliminaries, and evaluation metrics. It synthesizes representative architectures, methodologies, datasets, and benchmarks, and highlights open challenges such as modality imbalance and complex inter-modal interactions. The paper also outlines future directions, including improving modality coverage, leveraging parameter-efficient fine-tuning, maintaining pre-trained multimodal knowledge, and promoting trustworthy MMCL. By providing a structured taxonomy and practical benchmarks, it aims to accelerate research toward robust, scalable, and adaptable MMCL systems for real-world, multimodal data streams.

Abstract

Continual learning (CL) aims to empower machine learning models to learn continually from new data, while building upon previously acquired knowledge without forgetting. As machine learning models have evolved from small to large pre-trained architectures, and from supporting unimodal to multimodal data, multimodal continual learning (MMCL) methods have recently emerged. The primary challenge of MMCL is that it goes beyond a simple stacking of unimodal CL methods, as such straightforward approaches often yield unsatisfactory performance. In this work, we present the first comprehensive survey on MMCL. We provide essential background knowledge and MMCL settings, as well as a structured taxonomy of MMCL methods. We categorize existing MMCL methods into four categories, i.e., regularization-based, architecture-based, replay-based, and prompt-based methods, explaining their methodologies and highlighting their key innovations. Additionally, to prompt further research in this field, we summarize open MMCL datasets and benchmarks, and discuss several promising future directions for investigation and development. We have also created a GitHub repository for indexing relevant MMCL papers and open resources available at https://github.com/LucyDYu/Awesome-Multimodal-Continual-Learning.

Recent Advances of Multimodal Continual Learning: A Comprehensive Survey

TL;DR

This survey articulates a comprehensive view of multimodal continual learning (MMCL) by organizing methods into four families—regularization-based, architecture-based, replay-based, and prompt-based—while detailing MMCL scenarios, preliminaries, and evaluation metrics. It synthesizes representative architectures, methodologies, datasets, and benchmarks, and highlights open challenges such as modality imbalance and complex inter-modal interactions. The paper also outlines future directions, including improving modality coverage, leveraging parameter-efficient fine-tuning, maintaining pre-trained multimodal knowledge, and promoting trustworthy MMCL. By providing a structured taxonomy and practical benchmarks, it aims to accelerate research toward robust, scalable, and adaptable MMCL systems for real-world, multimodal data streams.

Abstract

Continual learning (CL) aims to empower machine learning models to learn continually from new data, while building upon previously acquired knowledge without forgetting. As machine learning models have evolved from small to large pre-trained architectures, and from supporting unimodal to multimodal data, multimodal continual learning (MMCL) methods have recently emerged. The primary challenge of MMCL is that it goes beyond a simple stacking of unimodal CL methods, as such straightforward approaches often yield unsatisfactory performance. In this work, we present the first comprehensive survey on MMCL. We provide essential background knowledge and MMCL settings, as well as a structured taxonomy of MMCL methods. We categorize existing MMCL methods into four categories, i.e., regularization-based, architecture-based, replay-based, and prompt-based methods, explaining their methodologies and highlighting their key innovations. Additionally, to prompt further research in this field, we summarize open MMCL datasets and benchmarks, and discuss several promising future directions for investigation and development. We have also created a GitHub repository for indexing relevant MMCL papers and open resources available at https://github.com/LucyDYu/Awesome-Multimodal-Continual-Learning.
Paper Structure (29 sections, 5 equations, 6 figures, 4 tables)

This paper contains 29 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Graphical illustrations of CL and MMCL. (a) Unimodal CL. The model continually learns new tasks. While learning a new task, the model tends to forget the previously learned tasks. CL aims to mitigate forgetting. (b) Multimodal CL. In the multimodal setting, the model continually learns new tasks, and the dataset is multimodal. Forgetting in MMCL tends to be more severe due to challenges mentioned in Section . Example tasks in \ref{['fig: non-tech illustration Unimodal CL']} are adapted based on SCD Lao2023MultiDomain, VQACL Zhang2023VQACL, CLEVR Johnson2017CLEVR and GQA Hudson2019GQA. Example tasks in \ref{['fig: non-tech illustration MMCL']} are adapted based on SCD Lao2023MultiDomain, VQACL Zhang2023VQACL, ODU Sun2021Multimodal and CMR-MFN Wang2023Confusion.
  • Figure 2: MMCL challenges. We use a vision-language model architecture adapted from ViLT Kim2021Vilt as the example to illustrate.
  • Figure 3: Illustrations of CL and MMCL. Notations are defined in \ref{['tab:Symbol']}. (a) Unimodal CL. The model is trained from scratch. (b) Multimodal CL. The model is trained from scratch. (c) Multimodal CL. The model is trained using a pre-trained MM backbone.
  • Figure 4: Illustrations of MMCL scenarios (defined in Section \ref{['sec: Continual Learning Scenarios']}). Notations are defined in \ref{['tab:Symbol']}. (a) Class-incremental Learning (CIL). (b) Domain-incremental Learning (DIL). (c) Task-incremental Learning (TIL). (d) Generative Domain-incremental Learning (GDIL). (e) Modality-dynamic Task-incremental Learning (MDTIL). Figures \ref{['fig: Illustration CIL']}, \ref{['fig: Illustration DIL']} and \ref{['fig: Illustration TIL']} are partially adapted and redrawn based on Masana2022ClassIncrementalYang2024Recent, with examples adapted based on MTIL Zheng2023Preventing, CIFAR10 Krizhevsky2009Learning and Flowers Nilsback2008Automated. Examples in \ref{['fig: Illustration GDIL']} are adapted based on VQAv2 Goyal2017Making, VQACL Zhang2023VQACL and SGP Lei2023Symbolic. Examples in \ref{['fig: Illustration MDTIL']} are adapted based on CLiMB Srinivasan2022CLiMB, SNLI-VE Xie2019Visual and IMDb Maas2011Learning.
  • Figure 5: Taxonomy of multimodal continual learning (MMCL). We divide MMCL methods into four categories: Regularization-based (Section \ref{['sec: Continual Learning_Regularization-based']}), Architecture-based (Section \ref{['sec: Continual Learning_Architecture-based']}), Replay-based (Section \ref{['sec: Continual Learning_Replay-based']}) and Prompt-based (Section \ref{['sec: Continual Learning_Prompt-based']}).
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1: Task Sequence
  • Definition 2: Continual Learning (CL)
  • Remark 1
  • Definition 3: Modality-IDs and Set
  • Definition 4: Unimodal and Multimodal
  • Definition 5: Modality-static, Modality-increasing, Modality-decreasing and Modality-switching
  • Definition 6: Subsequence
  • Definition 7: Modality-dynamic
  • Definition 8: Multimodal Continual Learning (MMCL)
  • Remark 2