Table of Contents
Fetching ...

Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent

Junda Wu, Yuxin Xiong, Xintong Li, Yu Xia, Ruoyu Wang, Yu Wang, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Jingbo Shang, Julian McAuley

TL;DR

The paper addresses visual forgetting in multimodal LLMs during instruction-tuning by introducing an information-theoretic, effective-rank perspective to quantify visual knowledge degradation. It proposes Modality-Decoupled Gradient Descent (MDGD), which orthogonally separates visual representation learning from task alignment and regularizes gradient updates to preserve the pre-trained visual knowledge. A gradient-masking variant (MDGD-GM) enables parameter-efficient fine-tuning, reducing computational load while maintaining performance. Extensive experiments on LLaVA-1.5 and MiniCPM backbones show that MDGD substantially mitigates visual forgetting and yields strong downstream adaptation, outperforming baselines like LoRA and Model Tailor. The work offers a practical, scalable solution for robust multimodal instruction-tuning with preserved visual understanding.

Abstract

Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.

Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent

TL;DR

The paper addresses visual forgetting in multimodal LLMs during instruction-tuning by introducing an information-theoretic, effective-rank perspective to quantify visual knowledge degradation. It proposes Modality-Decoupled Gradient Descent (MDGD), which orthogonally separates visual representation learning from task alignment and regularizes gradient updates to preserve the pre-trained visual knowledge. A gradient-masking variant (MDGD-GM) enables parameter-efficient fine-tuning, reducing computational load while maintaining performance. Extensive experiments on LLaVA-1.5 and MiniCPM backbones show that MDGD substantially mitigates visual forgetting and yields strong downstream adaptation, outperforming baselines like LoRA and Model Tailor. The work offers a practical, scalable solution for robust multimodal instruction-tuning with preserved visual understanding.

Abstract

Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.

Paper Structure

This paper contains 24 sections, 15 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: The top-10 image tokens with the highest effective ranks on OKVQA and POPE encoded by LLaVA, and PathVQA and POPE encoded by MiniCPM. We compare pretrained, finetuned, and MDGD-finetuned models. Effective rank wei2024diff quantifies representation richness, and we novelly use it to analyze visual degradation in instruction-tuned MLLMs. Results show that MDGD preserves higher effective rank, mitigating visual forgetting.
  • Figure 2: Illustration of the proposed method. To mitigate suboptimal optimization and prevent visual forgetting, we first project $\nabla_{\theta} \mathcal{L}_{vl}$ onto the direction orthogonal to $\nabla_{\theta} \mathcal{L}_{v}$, obtaining $\bar{g}_{\theta}$. Next, we project $\bar{g}_{\theta}$ onto the direction of $\bar{g}_\phi$, yielding $\tilde{g}_\theta$. This process guides the gradient towards the optimal region without visual forgetting.
  • Figure 3: T-SNE plots of the distribution of extracted visual $\pi(X^v)$ and multimodal $z^{vl}$ representations from pre-trained LLaVA-1.5, and models with direct fine-tuning and MDGD on OKVQA and Flickr30K.
  • Figure 4: T-SNE plots of the distribution of extracted visual $\pi(X^v)$ and multimodal $z^{vl}$ representations from pre-trained MiniCPM, and models with direct fine-tuning and MDGD on PathVQA and TextCaps.
  • Figure 5: Illustration of (a) the learning process of three methods based on task loss $\mathcal{L}_{vl}(\phi,\theta)$, (b) the average regularized cosine similarity $\frac{\Bar{g}_\theta^\top \Bar{g}_\phi}{\|\Bar{g}_\phi\| \|\Bar{g}_\theta\|}$ in Eq.\ref{['eq:masking']} for gradient masking at varying ratios, and (c) the visual representation loss $\mathcal{L}_v(\phi,\theta)$ in Eq.\ref{['eq:visual_loss']} for gradient masking at varying ratios $\alpha$.
  • ...and 1 more figures