Table of Contents
Fetching ...

Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs

Dingkun Zhang, Shuhan Qi, Xinyu Xiao, Kehai Chen, Xuan Wang

TL;DR

This work identifies forgetting and misalignment as dual degradation factors in modality-incremental continual learning for Multimodal LLMs and introduces MERA, a two-stage, merging-then-realigning paradigm. The merging stage uses a weighted average to integrate new modality knowledge into the modality-agnostic LLM, while the realigning stage fine-tunes lightweight modality connectors with a small replay dataset to restore encoder-LLM alignment, all without heavy architectural changes. Empirical results across four modalities show MERA achieves near-lossless MCL performance, notably 99.84% backward relative gain with 10% replay, and outperforms state-of-the-art baselines across training orders, underscoring misalignment as a key issue in MCL. The approach is architecture- and budget-friendly, highlighting practical routes to efficiently extend MLLMs to additional modalities and guiding future exploration of cross-modal interactions during continual learning.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is efficient to reuse the existing ones and extend them to more modalities through Modality-incremental Continual Learning (MCL). The exploration of MCL is in its early stages. In this work, we dive into the causes of performance degradation in MCL. We uncover that it suffers not only from forgetting as in traditional continual learning, but also from misalignment between the modality-agnostic and modality-specific components. To this end, we propose an elegantly simple MCL paradigm called "MErge then ReAlign" (MERA) to address both forgetting and misalignment. MERA avoids introducing heavy model budgets or modifying model architectures, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate the impressive performance of MERA, holding an average of 99.84\% Backward Relative Gain when extending to four modalities, achieving nearly lossless MCL performance. Our findings underscore the misalignment issue in MCL. More broadly, our work showcases how to adjust different components of MLLMs during continual learning.

Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs

TL;DR

This work identifies forgetting and misalignment as dual degradation factors in modality-incremental continual learning for Multimodal LLMs and introduces MERA, a two-stage, merging-then-realigning paradigm. The merging stage uses a weighted average to integrate new modality knowledge into the modality-agnostic LLM, while the realigning stage fine-tunes lightweight modality connectors with a small replay dataset to restore encoder-LLM alignment, all without heavy architectural changes. Empirical results across four modalities show MERA achieves near-lossless MCL performance, notably 99.84% backward relative gain with 10% replay, and outperforms state-of-the-art baselines across training orders, underscoring misalignment as a key issue in MCL. The approach is architecture- and budget-friendly, highlighting practical routes to efficiently extend MLLMs to additional modalities and guiding future exploration of cross-modal interactions during continual learning.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is efficient to reuse the existing ones and extend them to more modalities through Modality-incremental Continual Learning (MCL). The exploration of MCL is in its early stages. In this work, we dive into the causes of performance degradation in MCL. We uncover that it suffers not only from forgetting as in traditional continual learning, but also from misalignment between the modality-agnostic and modality-specific components. To this end, we propose an elegantly simple MCL paradigm called "MErge then ReAlign" (MERA) to address both forgetting and misalignment. MERA avoids introducing heavy model budgets or modifying model architectures, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate the impressive performance of MERA, holding an average of 99.84\% Backward Relative Gain when extending to four modalities, achieving nearly lossless MCL performance. Our findings underscore the misalignment issue in MCL. More broadly, our work showcases how to adjust different components of MLLMs during continual learning.

Paper Structure

This paper contains 36 sections, 5 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Illustration of misalignment and the mechanism of our proposed realigning. $\phi_i$ is the feature distribution of the $i$-th modality. Regions in yellow represent the LLM's expected distribution of the connector's output. (a) and (b) are the states of the last learning stage and the ideal MCL after learning on a new modality. (c) shows the actual misalignment after learning on a new modality. (d) demonstrates the mechanism of our proposed realigning.
  • Figure 2: Pipeline of the proposed MERA. The procedures in gray boxes involve training. figs/snow.pdf and figs/flame.pdf represent the frozen and trainable modules, respectively.
  • Figure 3: Progressive Backward Relative Gain in modality-incremental continual learning. For each stage $i$, we plot the average score of the corresponding Backward Relative Gain with two different training orders. We set Backward Relative Gain to 100% for the $1$st stage, denoting the initial performance without degradation. Exceptionally, the initial Backward Relative Gain of EProj is not 100% since it only tunes the modality-specific components, causing an initial performance degradation.
  • Figure 4: Progressive Forward Relative Gain in modality-incremental continual learning. For each stage $i$, we plot the average score of the corresponding Forward Relative Gain with two different training orders. We set Forward Relative Gain to 100% for the $1$st stage, denoting the initial lossless plasticity. Exceptionally, the initial Forward Relative Gain of EProj is not 100% since it only tunes the modality-specific components, causing an initial loss of plasticity.