Table of Contents
Fetching ...

An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model

Yuxin Tian, Mouxing Yang, Yunfan Li, Dayiheng Liu, Xingzhang Ren, Xi Peng, Jiancheng Lv

TL;DR

The paper investigates how data size and tunable parameter budget influence five exogenous PEFT methods applied to a Vision-Language Pre-training model across two VL tasks. By comparing embedding- and layer-composition techniques, it reveals that data size matters when downstream data/task diverge from pre-training, whereas data size is less impactful and parameter-size effects are non-monotonic when downstream tasks are consistent with pre-training. The study demonstrates that layer-based approaches (e.g., LoRA) often yield superior efficiency and performance, and that combining PEFT with selective full finetuning of the final classifier can surpass full fine-tuning under certain conditions. These insights offer practical guidance for choosing training strategies in PEFT-equipped VLP systems and highlight how task-data alignment shapes adaptation dynamics.

Abstract

Recent studies applied Parameter Efficient Fine-Tuning techniques (PEFTs) to efficiently narrow the performance gap between pre-training and downstream. There are two important factors for various PEFTs, namely, the accessible data size and fine-tunable parameter size. A natural expectation for PEFTs is that the performance of various PEFTs is positively related to the data size and fine-tunable parameter size. However, according to the evaluation of five PEFTs on two downstream vision-language (VL) tasks, we find that such an intuition holds only if the downstream data and task are not consistent with pre-training. For downstream fine-tuning consistent with pre-training, data size no longer affects the performance, while the influence of fine-tunable parameter size is not monotonous. We believe such an observation could guide the choice of training strategy for various PEFTs.

An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model

TL;DR

The paper investigates how data size and tunable parameter budget influence five exogenous PEFT methods applied to a Vision-Language Pre-training model across two VL tasks. By comparing embedding- and layer-composition techniques, it reveals that data size matters when downstream data/task diverge from pre-training, whereas data size is less impactful and parameter-size effects are non-monotonic when downstream tasks are consistent with pre-training. The study demonstrates that layer-based approaches (e.g., LoRA) often yield superior efficiency and performance, and that combining PEFT with selective full finetuning of the final classifier can surpass full fine-tuning under certain conditions. These insights offer practical guidance for choosing training strategies in PEFT-equipped VLP systems and highlight how task-data alignment shapes adaptation dynamics.

Abstract

Recent studies applied Parameter Efficient Fine-Tuning techniques (PEFTs) to efficiently narrow the performance gap between pre-training and downstream. There are two important factors for various PEFTs, namely, the accessible data size and fine-tunable parameter size. A natural expectation for PEFTs is that the performance of various PEFTs is positively related to the data size and fine-tunable parameter size. However, according to the evaluation of five PEFTs on two downstream vision-language (VL) tasks, we find that such an intuition holds only if the downstream data and task are not consistent with pre-training. For downstream fine-tuning consistent with pre-training, data size no longer affects the performance, while the influence of fine-tunable parameter size is not monotonous. We believe such an observation could guide the choice of training strategy for various PEFTs.
Paper Structure (23 sections, 15 equations, 11 figures, 1 table)

This paper contains 23 sections, 15 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: The performance is only affected by the size of fine-tunable parameters when the downstream task and data are consistent with pre-training. Otherwise, the performance is positively related to data and parameter size.
  • Figure 2: Unified view of evaluated PEFT methods within a transformer block.
  • Figure 3: Layer composition PEFTs achieve better performance than embedding composition on MSCOCO Caption. Both layer and embedding composition PEFTs could achieve comparable performance with full fine-tuning. The performance of the tested PEFTs is regardless of the accessible data size. Increasing the size of fine-tunable parameters by simultaneously fine-tuning the final classifier only improves the performance of prompt-tuning, and hurts that of the layer composition.
  • Figure 4: Layer composition PEFTs achieve better performance than embedding composition on VQAv2. Empirically, Layer composition PEFTs could achieve comparable performance with full fine-tuning, while embedding composition PEFTs cannot. The performance of the tested PEFTs is positively correlated to the accessible training data and fine-tunable parameters. Additionally, simultaneously fine-tuning the final classifier of the model could further boost the performance and even achieve superior performance than full fine-tuning.
  • Figure 5: The comparison of various PEFTs.
  • ...and 6 more figures