Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?
Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, Dongfang Liu
TL;DR
Facing the Elephant in the Room systematically compares visual prompt tuning (VPT) against full finetuning (FT) across 19 VTAB-1k tasks using a Vision Transformer backbone. By framing transfer-learning scenarios along task disparity and data distributions and measuring distribution shifts with Fréchet Inception Distance, the study finds VPT dominates in three of four quadrants, especially under limited data, while FT closes the gap as data grows. It demonstrates that overfitting is only part of the story and that preserving pretrained features via prompts—rather than merely adding parameters—drives VPT’s advantage, with GradCAM and Integrated Gradients visualizations supporting improved feature learning under prompting. The results provide practical guidance on when to adopt VPT for parameter-efficient transfer learning in large-scale vision models and point to rich avenues for understanding prompt-based mechanisms.
Abstract
As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
