Table of Contents
Fetching ...

When Continue Learning Meets Multimodal Large Language Model: A Survey

Yukang Huo, Hao Tang

TL;DR

This survey tackles the challenge of continual learning in multimodal large language models by synthesizing insights from roughly 440 papers. It organizes the literature into foundational MLLM concepts, continual learning in unimodal, multimodal non-LLMs, and LLM contexts, and then analyzes state-of-the-art model innovations, learning methods, and benchmarks. Key contributions include a taxonomy of frameworks and methods across three dimensions (frameworks, objectives, and modules), a comprehensive review of benchmark suites (e.g., ROPE, CVQA, II-Bench, ConBench, COMPBENCH, Hallu-PI, ReForm-Eval, VisionGraph), and identified gaps in evaluation standardization and interpretability. The paper also highlights practical applications and provides forward-looking guidance on improving forgetting mitigation, benchmark standardization, and transparency, aiming to accelerate robust, scalable deployment of continual learning in multimodal intelligent systems.

Abstract

Recent advancements in Artificial Intelligence have led to the development of Multimodal Large Language Models (MLLMs). However, adapting these pre-trained models to dynamic data distributions and various tasks efficiently remains a challenge. Fine-tuning MLLMs for specific tasks often causes performance degradation in the model's prior knowledge domain, a problem known as 'Catastrophic Forgetting'. While this issue has been well-studied in the Continual Learning (CL) community, it presents new challenges for MLLMs. This review paper, the first of its kind in MLLM continual learning, presents an overview and analysis of 440 research papers in this area.The review is structured into four sections. First, it discusses the latest research on MLLMs, covering model innovations, benchmarks, and applications in various fields. Second, it categorizes and overviews the latest studies on continual learning, divided into three parts: non-large language models unimodal continual learning (Non-LLM Unimodal CL), non-large language models multimodal continual learning (Non-LLM Multimodal CL), and continual learning in large language models (CL in LLM). The third section provides a detailed analysis of the current state of MLLM continual learning research, including benchmark evaluations, architectural innovations, and a summary of theoretical and empirical studies.Finally, the paper discusses the challenges and future directions of continual learning in MLLMs, aiming to inspire future research and development in the field. This review connects the foundational concepts, theoretical insights, method innovations, and practical applications of continual learning for multimodal large models, providing a comprehensive understanding of the research progress and challenges in this field, aiming to inspire researchers in the field and promote the advancement of related technologies.

When Continue Learning Meets Multimodal Large Language Model: A Survey

TL;DR

This survey tackles the challenge of continual learning in multimodal large language models by synthesizing insights from roughly 440 papers. It organizes the literature into foundational MLLM concepts, continual learning in unimodal, multimodal non-LLMs, and LLM contexts, and then analyzes state-of-the-art model innovations, learning methods, and benchmarks. Key contributions include a taxonomy of frameworks and methods across three dimensions (frameworks, objectives, and modules), a comprehensive review of benchmark suites (e.g., ROPE, CVQA, II-Bench, ConBench, COMPBENCH, Hallu-PI, ReForm-Eval, VisionGraph), and identified gaps in evaluation standardization and interpretability. The paper also highlights practical applications and provides forward-looking guidance on improving forgetting mitigation, benchmark standardization, and transparency, aiming to accelerate robust, scalable deployment of continual learning in multimodal intelligent systems.

Abstract

Recent advancements in Artificial Intelligence have led to the development of Multimodal Large Language Models (MLLMs). However, adapting these pre-trained models to dynamic data distributions and various tasks efficiently remains a challenge. Fine-tuning MLLMs for specific tasks often causes performance degradation in the model's prior knowledge domain, a problem known as 'Catastrophic Forgetting'. While this issue has been well-studied in the Continual Learning (CL) community, it presents new challenges for MLLMs. This review paper, the first of its kind in MLLM continual learning, presents an overview and analysis of 440 research papers in this area.The review is structured into four sections. First, it discusses the latest research on MLLMs, covering model innovations, benchmarks, and applications in various fields. Second, it categorizes and overviews the latest studies on continual learning, divided into three parts: non-large language models unimodal continual learning (Non-LLM Unimodal CL), non-large language models multimodal continual learning (Non-LLM Multimodal CL), and continual learning in large language models (CL in LLM). The third section provides a detailed analysis of the current state of MLLM continual learning research, including benchmark evaluations, architectural innovations, and a summary of theoretical and empirical studies.Finally, the paper discusses the challenges and future directions of continual learning in MLLMs, aiming to inspire future research and development in the field. This review connects the foundational concepts, theoretical insights, method innovations, and practical applications of continual learning for multimodal large models, providing a comprehensive understanding of the research progress and challenges in this field, aiming to inspire researchers in the field and promote the advancement of related technologies.

Paper Structure

This paper contains 61 sections, 5 figures, 37 tables.

Figures (5)

  • Figure 1: Timeline of Multimodal Large Model Development.
  • Figure 2: Statistics of the CVQA Benchmark. mathew2021docvqa
  • Figure 3: Model performance per Country-Language pair. The blue lines indicate separation by continent. All models show similar behaviour in the majority of cases, despite having different sizes. mathew2021docvqa
  • Figure 4: Overview of 19 evaluation detailed categories in ConBench. zhang2024unveiling
  • Figure 5: Assessed capability dimensions and tasks in ReForm-Eval. "Desc" and "Classif" are respectively short for description and classification. li2024reform