Table of Contents
Fetching ...

Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models

Xiwen Wei, Mustafa Munir, Radu Marculescu

TL;DR

The paper tackles forgetting in unified multimodal generative models during continual instruction tuning by introducing Modality-Decoupled Experts (MoDE), which separates text and image updates via a Text MoE and a Visual LoRA, while freezing the pre-trained backbone. It furnishes a gradient-conflict-based theoretical justification showing that modality decoupling eliminates first-order interference, yielding a second-order drift in image generation that scales as O(η^2). Empirically, MoDE outperforms strong baselines across multiple backbones and benchmarks, preserving image-generation quality and improving multimodal understanding under continual learning. The work demonstrates that decoupled modality updates plus knowledge distillation provide a robust, scalable path for continual learning in autoregressive UMGMs. This advances practical deployment of unified models that can continuously acquire new multimodal capabilities without erasing prior cross-modal alignment.

Abstract

Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git

Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models

TL;DR

The paper tackles forgetting in unified multimodal generative models during continual instruction tuning by introducing Modality-Decoupled Experts (MoDE), which separates text and image updates via a Text MoE and a Visual LoRA, while freezing the pre-trained backbone. It furnishes a gradient-conflict-based theoretical justification showing that modality decoupling eliminates first-order interference, yielding a second-order drift in image generation that scales as O(η^2). Empirically, MoDE outperforms strong baselines across multiple backbones and benchmarks, preserving image-generation quality and improving multimodal understanding under continual learning. The work demonstrates that decoupled modality updates plus knowledge distillation provide a robust, scalable path for continual learning in autoregressive UMGMs. This advances practical deployment of unified models that can continuously acquire new multimodal capabilities without erasing prior cross-modal alignment.

Abstract

Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git

Paper Structure

This paper contains 36 sections, 2 theorems, 18 equations, 6 figures, 13 tables.

Key Result

Proposition 1

When fine-tuning on the text-generation tasks, a stochastic gradient descent (SGD) update with sufficiently small step size $\eta$ modifies the model parameters as $\theta \leftarrow \theta - \eta g_t$. The resulting change in the visual loss is: where $H_v = \nabla^2_\theta \mathcal{L}_v$ is the Hessian of the visual loss. Hence, when $\langle g_t, g_v\rangle < 0$, a step optimizing text generat

Figures (6)

  • Figure 1: (a) illustrates catastrophic forgetting during naive sequential instruction tuning. Continual instruction tuning starts with the Chameleon chameleon model on tasks ScienceQA$\rightarrow$TextVQA$\rightarrow$ImageNet$\rightarrow$GQA$\rightarrow$VizWiz. "Start" refers to the performance immediately after the model has been tuned on a task, while "End" denotes performance after completing all tasks. A larger gap between the two bars indicates more severe forgetting. (b) visualizes forgetting in both multimodal generation tasks (in red, representing inter-modal forgetting) and multimodal understanding tasks (in green, representing intra-modal forgetting). Our proposed MoDE mitigates both types of forgetting, preserving performance across modalities.
  • Figure 2: Inter-modal catastrophic forgetting during continual instruction tuning across three VQA tasks. The pre-trained UMGM serves as the upper bound for image generation quality. CLIP scores under each sample reflect text-image alignment in CLIP feature space. Red bounding boxes highlight regions with degraded image quality and low CLIP scores, indicating increasing misalignment between prompts and generated images. For instance, the image generated for the prompt "A photo of a car" depicts a building instead of a car (VQA task #3).
  • Figure 3: An autoregressive UMGM with our proposed MoDE integrated into its linear layers (MLPs). V-Adapter (Visual LoRA, in the light blue box): LoRA specialized for both the generation and understanding of image tokens. T-MoE Adapters (Text Mixture-of-Experts LoRA, in the light brown box): MoE-LoRA designed for text tokens, supporting continual learning of multimodal understanding tasks. T-router computes the routing weights $g_j(x)$ that determine how much each expert LoRA contributes for a given text token. The circled "+" symbol denotes addition. During continual instruction tuning, the T-MoE primarily updates for text answers, while the V-Adapter handles image tokens. To preserve the model's image generation capability and mitigate inter-modal forgetting, we apply a knowledge distillation loss from the original (teacher) UMGM to the new (student) model’s V-Adapter.
  • Figure 4: Qualitative results of image generation on the Chameleon chameleon model. Our method generates more visually coherent and faithful images compared to other baselines (e.g., the realistic dog in the first row, steam in the second row). Additional examples are provided in Appendix \ref{['appx:qual']}.
  • Figure 5: Cosine distance distribution between text and image modality gradients in modality-coupled MoE LoRA moelora on the ScienceQA scienceqa dataset. The y-axis shows the proportion of parameters corresponding to each cosine distance. Lower cosine distance values indicate greater gradient conflict, with 0 denoting orthogonal update directions.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Modality Gradient Conflict
  • Proposition 1
  • Proposition 2