Table of Contents
Fetching ...

Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning

Wenke Huang, Jian Liang, Zekun Shi, Didi Zhu, Guancheng Wan, He Li, Bo Du, Dacheng Tao, Mang Ye

TL;DR

This work proposes measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values, and applies an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks.

Abstract

Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.

Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning

TL;DR

This work proposes measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values, and applies an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks.

Abstract

Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.

Paper Structure

This paper contains 15 sections, 10 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Background and Motivation. Fine-tuning Multimodal Large Language Model (MLLM) on downstream tasks typically involves training () the connector and LLM modules, and freezing () the visual encoder. We reveal a higher parameter importance difference (PID) on unseen downstream distributions, e.g., Flickr30k, compared to seen upstream distribution, e.g., OKVQA. PID $= cos(|w^*|,|g|)^{-2}$. We utilize the absolute value of the pre-trained weight $|w^*|$ and and fine-tuning gradients $|g|$ to represent the upstream and downstream parameter importance.
  • Figure 2: Conceptual Comparison. (a) SPIDER iteratively measures the parameter importance discrepancy to construct the update mask which protects generation and squeezes specialization information on the selected elements. (b) DARE combines the learned elements with the pre-trained and further rescale the candidate ones. means frozen pre-trained elements. denotes current learning parameters. represents completed learned ones.
  • Figure 3: Ablation Comparison on Response Output on Flickr30k. Text prompt is Write a short description for the image. Full FT better follow the instructions than Zero-shot, but Full FT introduces hallucination (e.g., “at a dog competition”), while Zero-shot lacks task details. Please refer to \ref{['sec:ablation']}.
  • Figure 4: Visualization Comparison. Radar charts plots fine-tuning methods results across four pre-trained source datasets and target datasets, i.e., Flickr30k and COCO-Capation. Our method achieves a better generalization and specialization trade-off.
  • Figure 5: Comparison on Large Fine-Tuning Epochs $E$ from (5 rounds to 10 rounds) on Flickr30k. Refer to \ref{['sec:compSOTA']} for details.
  • ...and 2 more figures