Table of Contents
Fetching ...

Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework

Jingyang Qiao, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yanyun Qu, Yuan Xie

TL;DR

This paper establishes Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics and proposes an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting.

Abstract

Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.

Continual-NExT: A Unified Comprehension And Generation Continual Learning Framework

TL;DR

This paper establishes Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics and proposes an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting.

Abstract

Dual-to-Dual MLLMs refer to Multimodal Large Language Models, which can enable unified multimodal comprehension and generation through text and image modalities. Although exhibiting strong instantaneous learning and generalization capabilities, Dual-to-Dual MLLMs still remain deficient in lifelong evolution, significantly affecting continual adaptation to dynamic real-world scenarios. One of the challenges is that learning new tasks inevitably destroys the learned knowledge. Beyond traditional catastrophic forgetting, Dual-to-Dual MLLMs face other challenges, including hallucination, instruction unfollowing, and failures in cross-modal knowledge transfer. However, no standardized continual learning framework for Dual-to-Dual MLLMs has been established yet, leaving these challenges unexplored. Thus, in this paper, we establish Continual-NExT, a continual learning framework for Dual-to-Dual MLLMs with deliberately-architected evaluation metrics. To improve the continual learning capability of Dual-to-Dual MLLMs, we propose an efficient MAGE (Mixture and Aggregation of General LoRA and Expert LoRA) method to further facilitate knowledge transfer across modalities and mitigate forgetting. Extensive experiments demonstrate that MAGE outperforms other continual learning methods and achieves state-of-the-art performance.
Paper Structure (28 sections, 20 equations, 13 figures, 14 tables)

This paper contains 28 sections, 20 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Continual-NExT: A framework for lifelong multimodal learning in Dual-to-Dual MLLMs. The left part shows its multimodal generation capability. The middle part illustrates its development across diverse training stages. The right part presents its supported downstream tasks.
  • Figure 2: Overview of Continual-NExT, including tasks, task types, sizes, and examples.
  • Figure 3: Parameter update patterns across heterogeneous tasks from global output view (left) to local input view (right). VQAv2->ImageNet refers to first training on VQAv2 (Task 1) and then training on ImageNet (Task 2). Pixel value represents the absolute discrepancy between the parameters learned from Task 1 and Task 2.
  • Figure 4: Results of different LoRA settings.
  • Figure 5: Overview of the MAGE. We insert four types of LoRA into the MLLM, Image/Text General LoRA (G-LoRA) and Image/Text Expert LoRA (E-LoRA). Then, we use the parameter-wise EMA strategy to update LoRA corresponding to the specific input and output modality of current task, while freezing that unrelated to it.
  • ...and 8 more figures