An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
Yuxin Tian, Mouxing Yang, Yunfan Li, Dayiheng Liu, Xingzhang Ren, Xi Peng, Jiancheng Lv
TL;DR
The paper investigates how data size and tunable parameter budget influence five exogenous PEFT methods applied to a Vision-Language Pre-training model across two VL tasks. By comparing embedding- and layer-composition techniques, it reveals that data size matters when downstream data/task diverge from pre-training, whereas data size is less impactful and parameter-size effects are non-monotonic when downstream tasks are consistent with pre-training. The study demonstrates that layer-based approaches (e.g., LoRA) often yield superior efficiency and performance, and that combining PEFT with selective full finetuning of the final classifier can surpass full fine-tuning under certain conditions. These insights offer practical guidance for choosing training strategies in PEFT-equipped VLP systems and highlight how task-data alignment shapes adaptation dynamics.
Abstract
Recent studies applied Parameter Efficient Fine-Tuning techniques (PEFTs) to efficiently narrow the performance gap between pre-training and downstream. There are two important factors for various PEFTs, namely, the accessible data size and fine-tunable parameter size. A natural expectation for PEFTs is that the performance of various PEFTs is positively related to the data size and fine-tunable parameter size. However, according to the evaluation of five PEFTs on two downstream vision-language (VL) tasks, we find that such an intuition holds only if the downstream data and task are not consistent with pre-training. For downstream fine-tuning consistent with pre-training, data size no longer affects the performance, while the influence of fine-tunable parameter size is not monotonous. We believe such an observation could guide the choice of training strategy for various PEFTs.
