An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Xiongtao Zhou; Jie He; Yuhua Ke; Guangyao Zhu; Víctor Gutiérrez-Basulto; Jeff Z. Pan

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, Víctor Gutiérrez-Basulto, Jeff Z. Pan

TL;DR

The paper tackles parameter-efficient fine-tuning for multimodal large language models, addressing the impracticality of full fine-tuning by evaluating four PEFT methods (Adapter, LoRA, IA3, Prefix-Tuning) across three open-source MLLMs. Using a standardized suite of seven multimodal benchmarks and systematic ablations on connector tuning, module location, data scale, stability, generalization, and hallucination, it findsAdapter-based approaches generally deliver the best overall performance, with LoRA close behind and IA3 offering strong results in certain scenarios. The study demonstrates that tuning connector layers is not universally beneficial and that PEFT effectiveness depends on dataset type (unseen vs seen) and data scale, with unseen tasks benefiting more from richer fine-tuning data. Practically, this work provides guidance for deploying PEFT in MLLMs to balance performance, stability, and hallucination mitigation, and it contributes a reproducible evaluation framework for ongoing research.

Abstract

Multimodal large language models (MLLMs) fine-tuned with multimodal instruction datasets have demonstrated remarkable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging as they usually contain billions of parameters. To address this issue, we study parameter-efficient fine-tuning (PEFT) methods for MLLMs. We aim to identify effective methods for enhancing the performance of MLLMs in scenarios where only a limited number of parameters are trained. This paper conducts empirical studies using four popular PEFT methods to fine-tune the LLM component of open-source MLLMs. We present a comprehensive analysis that encompasses various aspects, including the impact of PEFT methods on various models, parameters and location of the PEFT module, size of fine-tuning data, model stability based on PEFT methods, MLLM's generalization, and hallucination. We evaluated four PEFT methods on seven datasets from two different categories: unseen and seen datasets. Across all experiments, we show that the adapter is the best-performing PEFT method. At the same time, fine-tuning the connector layers leads to improved performance in most MLLMs. Code and data are available at https://github.com/alenai97/PEFT-MLLM.git.

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 23 figures, 15 tables)

This paper contains 30 sections, 6 equations, 23 figures, 15 tables.

Introduction
Related Work
PEFT Methods
Experiment Setup
Datasets
Implementations
Experimental Results
Main Results
Module Location
Data Scale
Stability Analysis
Overfitting and Generalization
Hallucination
Conclusion
Datasets Setup
...and 15 more sections

Figures (23)

Figure 1: Left): Architecture of a Multimodal Large Language Model. Starting from 7 questions, we comprehensively explored the impact of PEFT methods and the connector on MLLMs, all of which are illustrated on the Left. Right): A detailed illustration of the PEFT module structure for the four PEFT methods.
Figure 2: The comparative performance of four PEFT methods on seen and unseen datasets, with and without the use of a connector.
Figure 3: Train-Eval loss of all PEFT methods on SQA (img). The orange line shows Train Loss. Eval loss is colored with green.
Figure 4: Average performance fluctuation of four epochs on each source-target domain. We calculate the mean of four PEFT methods on each source-target domain and display the average performance fluctuation of all PEFT methods on those domain-pair.
Figure 5: Average accuracy of Results of various PEFT parameters on SQA (all). s: Bottleneck Size. r: LoRA Rank. vt: Virtual Token
...and 18 more figures

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

TL;DR

Abstract

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (23)