Table of Contents
Fetching ...

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Kun Kuang, Chao Wu

TL;DR

Catastrophic forgetting in multi-modal LLMs after fine-tuning undermines original task performance. Model Tailor presents a post-training, patch-based fusion that preserves pre-trained knowledge while selectively enhancing new-task capabilities via a sparse model patch and Hessian-informed decoration, guided by the Lottery Ticket Hypothesis and Optimal Brain Surgeon. Across InstructBLIP and LLaVA-1.5, it yields substantial improvements in retaining original-task performance (high H-scores) and adapting to new tasks, including effective multi-task fusion and synergy with LoRA. The approach is computationally practical through layer-wise optimization and SparseGPT-based Hessian handling, offering a robust, reusable tool for deploying MLLMs in evolving downstream settings.

Abstract

Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ($\leq$ 10\%) of fine-tuned parameters, maintaining $\sim$ 99\% effectiveness on original tasks versus pre-training, and achieving $\sim$ 97\% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the "model patch", based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to "decorate the patch", enhancing the model's performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1.5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

TL;DR

Catastrophic forgetting in multi-modal LLMs after fine-tuning undermines original task performance. Model Tailor presents a post-training, patch-based fusion that preserves pre-trained knowledge while selectively enhancing new-task capabilities via a sparse model patch and Hessian-informed decoration, guided by the Lottery Ticket Hypothesis and Optimal Brain Surgeon. Across InstructBLIP and LLaVA-1.5, it yields substantial improvements in retaining original-task performance (high H-scores) and adapting to new tasks, including effective multi-task fusion and synergy with LoRA. The approach is computationally practical through layer-wise optimization and SparseGPT-based Hessian handling, offering a robust, reusable tool for deploying MLLMs in evolving downstream settings.

Abstract

Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ( 10\%) of fine-tuned parameters, maintaining 99\% effectiveness on original tasks versus pre-training, and achieving 97\% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the "model patch", based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to "decorate the patch", enhancing the model's performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1.5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.
Paper Structure (26 sections, 4 theorems, 32 equations, 6 figures, 6 tables)

This paper contains 26 sections, 4 theorems, 32 equations, 6 figures, 6 tables.

Key Result

Theorem 4.3

Consider a layer $\ell$ within an MLLM $\mathcal{M}$, and let $\theta_m$ represent a parameter at index $m$. Altering $\theta_m$ from its fine-tuned state $\theta^m_{\text{sft}}$ to its pre-trained state $\theta^m_{\text{pre}}$, induces a increase $\Delta \mathcal{L}_\mathcal{T}$ in the model's loss where $\mathbf{H}^{-1}$ denotes the inverse of the Hessian matrix and $\left[\mathbf{H}^{-1}\right]

Figures (6)

  • Figure 1: Catastrophic Forgetting in Multi-modal Large Language Models. After fine-tuning on two distinct tasks (in orange), InstructBLIP and LLaVa1.5 exhibit a significant performance decline on their original tasks (in blue). Our method offers a remedy to this issue, mitigating the adverse effects of catastrophic forgetting.
  • Figure 2: Overall Framework of Model Tailor. Model Tailor consists of two primary steps. The first step in this process focuses on the identification of a "model patch", which is defined as a critical subset of fine-tuned parameters deemed essential for improving the model's effectiveness on a given target task. The second step is dedicated to applying compensatory adjustments, a methodological intervention designed to counterbalance any potential performance deficits that may arise from the exclusion of certain parameters during the fine-tuning phase.
  • Figure 3: Model Tailor on Multi-Task Scenario. "Performance oscillations" where models exhibit dips in efficacy on one task after fine-tuning on another, which are effectively bridged by Model Tailor's multi-task fusion.
  • Figure 4: Combination with LoRA on LLaVA. The application of Model Tailor to LLaVA-1.5 fine-tuned using LoRA yields significant performance improvements across various datasets, indicating that Model Tailor’s post-training refinement complements LoRA.
  • Figure 5: Comparison the performance of Model Tailor in various settings. (a-b) Results of InstructBLIP and LLaVA-1.5 at varying sparsity levels. (c-d) Results of InstructBLIP and LLaVA-1.5 with different mask proportions.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 4.1: Model Patch
  • Definition 4.2: Layer Patch
  • Theorem 4.3
  • Definition 4.4: Patch Decorator
  • Theorem 4.5
  • Theorem 3.1
  • Theorem 3.2