Table of Contents
Fetching ...

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

Cheng Han, Qifan Wang, Yiming Cui, Wenguan Wang, Lifu Huang, Siyuan Qi, Dongfang Liu

TL;DR

Facing the Elephant in the Room systematically compares visual prompt tuning (VPT) against full finetuning (FT) across 19 VTAB-1k tasks using a Vision Transformer backbone. By framing transfer-learning scenarios along task disparity and data distributions and measuring distribution shifts with Fréchet Inception Distance, the study finds VPT dominates in three of four quadrants, especially under limited data, while FT closes the gap as data grows. It demonstrates that overfitting is only part of the story and that preserving pretrained features via prompts—rather than merely adding parameters—drives VPT’s advantage, with GradCAM and Integrated Gradients visualizations supporting improved feature learning under prompting. The results provide practical guidance on when to adopt VPT for parameter-efficient transfer learning in large-scale vision models and point to rich avenues for understanding prompt-based mechanisms.

Abstract

As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.

Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

TL;DR

Facing the Elephant in the Room systematically compares visual prompt tuning (VPT) against full finetuning (FT) across 19 VTAB-1k tasks using a Vision Transformer backbone. By framing transfer-learning scenarios along task disparity and data distributions and measuring distribution shifts with Fréchet Inception Distance, the study finds VPT dominates in three of four quadrants, especially under limited data, while FT closes the gap as data grows. It demonstrates that overfitting is only part of the story and that preserving pretrained features via prompts—rather than merely adding parameters—drives VPT’s advantage, with GradCAM and Integrated Gradients visualizations supporting improved feature learning under prompting. The results provide practical guidance on when to adopt VPT for parameter-efficient transfer learning in large-scale vision models and point to rich avenues for understanding prompt-based mechanisms.

Abstract

As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.
Paper Structure (27 sections, 1 equation, 19 figures, 20 tables)

This paper contains 27 sections, 1 equation, 19 figures, 20 tables.

Figures (19)

  • Figure 1: VPT is identified to be preferable in 3 out of 4 transfer learning scenarios when downstream data is limited.
  • Figure 2: Full finetuning $vs.$ visual prompt tuning. Visual prompt tuning only learns a small set of prompts.
  • Figure 3: Overall FID score with respect to win/loss of visual prompt tuning on VTAB-1k benchmark categorized into Natural, Specialized and Structured, respectively. colors represent the method with higher accuracy. Under the same train-val split and low task disparity, a higher FID score might potentially lead to relatively higher accuracy on full finetuning, and vice versa. Solid filled represents that the disparity between the target task and the pretrained task is small while slash filled means a large disparity (e.g., distance, azimuth, counting). The FID scores show significant robustness in repeat runs (i.e., $std < 0.5\%$), we therefore do not present error bars here.
  • Figure 4: Analysis of dataset capacity on VTAB-1k Natural (left), Specialized (middle) and Structured (right), respectively. For each group, we select four datasets and plot accuracy plots on FT (i.e., solid lines) and VPT (i.e., dotted lines). We take the log scale of training data samples for better separation and each color stands for an individual classification task. Each point is given by average over five runs. In general, with the dataset increasing in size, the performance gap between FT and VPT becomes narrow. FT even surpasses VPT in 9 of 12 cases in this plot with the increasing of data samples (the same tendency takes place in other datasets). For per-task accuracy tables among different dataset scales and detailed FT and VPT accuracy plots, see the supplementary material §\ref{['Appendix:comprehensive_per_task']}.
  • Figure 5: Training/testing loss curves of six datasets from VTAB-1k. colors represent four training strategies: full finetuning, prompt tuning, mixed, and FT-then-PT, respectively. We show two representative tasks from VTAB-1k Natural, Specialized, and Structured, respectively. Full results and log scale results are presented in the supplementary material §\ref{['Appendix:comprehensive_per_task']}.
  • ...and 14 more figures