Table of Contents
Fetching ...

Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

He Li, Yuhui Zhang, Xiaohan Wang, Kaifeng Lyu, Serena Yeung-Levy

Abstract

The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.

Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

Abstract

The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.
Paper Structure (56 sections, 4 figures, 5 tables)

This paper contains 56 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Evaluation matrix. A 2$\times$2 design crossing text and image. In this work, for both text and images, we define in-distribution (ID) data as samples drawn from the same probability distribution as the training set. Conversely, out-of-distribution (OOD) data originates from a distribution not encountered during training; during evaluation, we report average accuracy within each quadrant. This setup allows us to systematically evaluate a comprehensive range of training and evaluation scenarios. Further details are provided in Appendix \ref{['apd:dataset-matrix']}.
  • Figure 2: Single-task fine-tuning across the evaluation matrix. Each curve traces checkpoints during fine-tuning: x-axis = ID accuracy on ImageNet validation (the fine-tuned task), y-axis = accuracy on an ID/OOD evaluation. Layout and colors follow Figure \ref{['fig:single-task-formulation']}. Legends show trainable part (method, learning rate). Performance is largely maintained in ID$^{T}$--OOD$^{I}$ and OOD$^{T}$--OOD$^{I}$ with simplest regularization on parameter updata, with a notable drop only in OOD$^{T}$--ID$^{I}$. Full hyperparameters are in Appendix \ref{['apd:train-figure2']}.
  • Figure 3: ImageWikiQA with class-label distractors.Left: an example transformation where one distractor is replaced by the correct class name. Right: accuracy with/without a class-name distractor, before fine-tuning and after fine-tuning, using LLM Backbone, Full, 1e-6. The substantial decrease in accuracy and the concurrent increase in “mischoice on class name” after fine-tuning indicate that the model ceases to follow prompt instructions, instead defaulting to outputting the choice with class label directly. Therefore, the primary issue is task-specific overfitting rather than catastrophic forgetting.
  • Figure 4: Ablations for data-hybrid training. (a) Mixing ImageNet-VQA with Flowers102, OCR-VQA, or LLaVA-665K (each at 50% of training instances). (b) Varying the LLaVA-665K mix from 0% to 70%; larger, darker markers denote higher ratios. Augmenting the training data with diverse textual inputs helps to alleviate task-specific overfitting. Consequently, this data-hybrid method improves model robustness in the OOD$^T$--ID$^I$ setting with minimal trade-offs for ID performance. Training details are in Appendix \ref{['apd:train-figure3']}.