Table of Contents
Fetching ...

Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim

TL;DR

This work tackles catastrophic forgetting during task-specific fine-tuning of Multimodal LLMs by introducing Model-Dowser, a data-free sparse finetuning framework. It defines a principled functional importance score $S^{(l)}_{ij} = \|J^{(l)}_i\|_2 \cdot |W^{(l)}_{ij}| \cdot |h^{(l-1)}_j|$ and estimates it without real data using synthetic probing and the Hutchinson trace estimator, yielding a robust, low-overhead mask to freeze high-sensitivity parameters. The method updates only the least important parameters, achieving strong memory efficiency ($\mathcal{O}(|P|)$) and superior forgetting mitigation across LLaVA and NVILA on diverse downstream tasks, even when fine-tuning deep decoder layers. The results demonstrate stable preservation of pretrained generalization while enabling task-specific adaptation, with consistent improvements over strong baselines and clear scalability to multi-billion-parameter models.

Abstract

Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

TL;DR

This work tackles catastrophic forgetting during task-specific fine-tuning of Multimodal LLMs by introducing Model-Dowser, a data-free sparse finetuning framework. It defines a principled functional importance score and estimates it without real data using synthetic probing and the Hutchinson trace estimator, yielding a robust, low-overhead mask to freeze high-sensitivity parameters. The method updates only the least important parameters, achieving strong memory efficiency () and superior forgetting mitigation across LLaVA and NVILA on diverse downstream tasks, even when fine-tuning deep decoder layers. The results demonstrate stable preservation of pretrained generalization while enabling task-specific adaptation, with consistent improvements over strong baselines and clear scalability to multi-billion-parameter models.

Abstract

Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
Paper Structure (44 sections, 5 theorems, 27 equations, 8 figures, 9 tables)

This paper contains 44 sections, 5 theorems, 27 equations, 8 figures, 9 tables.

Key Result

Theorem 3.1

Consider a layer $l$ in an MLLM model $f$. Under first-order Taylor approximation, the L2 norm of output shift $\Delta f$ when perturbing a weight $W^{(l)}_{ij}$ is given by: where $J^{(l)}_i = \partial f / \partial z^{(l)}_i$ denotes the $i$-th column of the Jacobian matrix of the network output with respect to the pre-activation vector $z^{(l)}$, and $h^{(l-1)}$ is the input activation of the $

Figures (8)

  • Figure 1: Performance comparison of catastrophic forgetting mitigation methods on LLaVA-1.5 (7B) and NVILA-Lite (2B) fine-tuned on ImageNet-R.(a)-(b) Radar charts illustrating the balance between downstream adaptation and upstream capabilities. (c)-(d) H-scorestability across varying fine-tuning depths. Model-Dowser (red line) consistently achieves robust performance compared to previous works.
  • Figure 2: Overall Architecture of Model-Dowser. The proposed method consists of three main steps. 1. Probing (\ref{['subsec:estimation']}): samples Jacobian matrix and input activation with synthetic data samples on every layer ($l$). 2. Compute Score (\ref{['subsec:estimation']}): generate parameter-wise importance score with Jacobian matrix, weight magnitude, and activation. 3. Sparse Finetune (\ref{['subsec:sparse_finetuning']}): update the least important $\rho\%$ of parameters (highlighted in yellow) based on their importance scores for the target downstream task.
  • Figure 3: Performance comparison across fine-tuning depths on COCO and ImageNet-R. Results show the average accuracy across all tasks for an update ratio of $\rho=0.1$ and various merging methods. The x-axis denotes the number of layers fine-tuned, counted incrementally from the final output layer toward the initial input layer.
  • Figure 4: Performance comparison across various mask ratios ($\rho$) on COCO-Caption and ImageNet-R using (a-b) NVILA-Lite-2b, and (c-d) LLaVA-1.5-7B. Results show the upstream and downstream performance ($A_{up}$, Avg, H-score).
  • Figure 5: Radar chart on diverse benchmarks on LLaVA-1.5-7B and NVILA-Lite when finetuning all layers.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 3.1: Functional shift for single-weight perturbation
  • Corollary 3.2: Functional shift for multi-weight perturbation
  • Theorem 2.1: \ref{['theorem:score']}
  • proof
  • Corollary 3.1: \ref{['cor:multi']}
  • proof
  • Theorem 4.1
  • proof