Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

Cheng Han; Qifan Wang; Yiming Cui; Wenguan Wang; Lifu Huang; Siyuan Qi; Dongfang Liu

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, Dongfang Liu

TL;DR

Facing the Elephant in the Room systematically compares visual prompt tuning (VPT) against full finetuning (FT) across 19 VTAB-1k tasks using a Vision Transformer backbone. By framing transfer-learning scenarios along task disparity and data distributions and measuring distribution shifts with Fréchet Inception Distance, the study finds VPT dominates in three of four quadrants, especially under limited data, while FT closes the gap as data grows. It demonstrates that overfitting is only part of the story and that preserving pretrained features via prompts—rather than merely adding parameters—drives VPT’s advantage, with GradCAM and Integrated Gradients visualizations supporting improved feature learning under prompting. The results provide practical guidance on when to adopt VPT for parameter-efficient transfer learning in large-scale vision models and point to rich avenues for understanding prompt-based mechanisms.

Abstract

As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 19 figures, 20 tables)

This paper contains 27 sections, 1 equation, 19 figures, 20 tables.

Introduction
Related Work
Methodology
When Should We Choose VPT?
Experiment Setup
Initial Experimental Results
Choose VPT for Three Quadrants of Transfer Learning
Gap Narrows as Downstream Datasets Expand
Why Does VPT Outperform FT?
Overfitting is Part of the Reason
Optimization does not Particularly Favor VPT
Further Observations
Visualizing the Effect of VPT on Feature Learning
Conclusion and Discussion
Per-task Training/testing Curve
...and 12 more sections

Figures (19)

Figure 1: VPT is identified to be preferable in 3 out of 4 transfer learning scenarios when downstream data is limited.
Figure 2: Full finetuning $vs.$ visual prompt tuning. Visual prompt tuning only learns a small set of prompts.
Figure 3: Overall FID score with respect to win/loss of visual prompt tuning on VTAB-1k benchmark categorized into Natural, Specialized and Structured, respectively. colors represent the method with higher accuracy. Under the same train-val split and low task disparity, a higher FID score might potentially lead to relatively higher accuracy on full finetuning, and vice versa. Solid filled represents that the disparity between the target task and the pretrained task is small while slash filled means a large disparity (e.g., distance, azimuth, counting). The FID scores show significant robustness in repeat runs (i.e., $std < 0.5\%$), we therefore do not present error bars here.
Figure 4: Analysis of dataset capacity on VTAB-1k Natural (left), Specialized (middle) and Structured (right), respectively. For each group, we select four datasets and plot accuracy plots on FT (i.e., solid lines) and VPT (i.e., dotted lines). We take the log scale of training data samples for better separation and each color stands for an individual classification task. Each point is given by average over five runs. In general, with the dataset increasing in size, the performance gap between FT and VPT becomes narrow. FT even surpasses VPT in 9 of 12 cases in this plot with the increasing of data samples (the same tendency takes place in other datasets). For per-task accuracy tables among different dataset scales and detailed FT and VPT accuracy plots, see the supplementary material §\ref{['Appendix:comprehensive_per_task']}.
Figure 5: Training/testing loss curves of six datasets from VTAB-1k. colors represent four training strategies: full finetuning, prompt tuning, mixed, and FT-then-PT, respectively. We show two representative tasks from VTAB-1k Natural, Specialized, and Structured, respectively. Full results and log scale results are presented in the supplementary material §\ref{['Appendix:comprehensive_per_task']}.
...and 14 more figures

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

TL;DR

Abstract

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

Authors

TL;DR

Abstract

Table of Contents

Figures (19)