Table of Contents
Fetching ...

Revisiting the Power of Prompt for Visual Tuning

Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, Meng Wang

TL;DR

Problem: how to efficiently adapt pre-trained Vision Transformers to downstream tasks, particularly with self-supervised pretraining where prompt tuning can underperform. Approach: Self-Prompt Tuning (SPT) initializes prompts with downstream token prototypes derived from patch embeddings and uses a lightweight token-construction pipeline to boost convergence while keeping the backbone frozen ($0.4\%$ learnable parameters or less). Findings: under MAE pretraining, SPT delivers $10\%$ to $30\%$ relative gains over VPT and can outperform full fine-tuning in many tasks; MoCo-v3 and supervised pretraining also benefit, with SPT robust to prompt length and scalable to larger models. Significance: provides a practical, data-efficient alternative to full fine-tuning for large CV models in SSL contexts, with extensive ablations supporting its generality and scalability.

Abstract

Visual prompt tuning (VPT) is a promising solution incorporating learnable prompt tokens to customize pre-trained models for downstream tasks. However, VPT and its variants often encounter challenges like prompt initialization, prompt length, and subpar performance in self-supervised pretraining, hindering successful contextual adaptation. This study commences by exploring the correlation evolvement between prompts and patch tokens during proficient training. Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes. The strategic initialization, a stand-in for the previous initialization, substantially improves performance in fine-tuning. To refine further, we optimize token construction with a streamlined pipeline that maintains excellent performance with almost no increase in computational expenses compared to VPT. Exhaustive experiments show our proposed approach outperforms existing methods by a remarkable margin. For instance, it surpasses full fine-tuning in 19 out of 24 tasks, using less than 0.4% of learnable parameters on the FGVC and VTAB-1K benchmarks. Notably, our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%. Besides, the experimental results demonstrate the proposed SPT is robust to prompt lengths and scales well with model capacity and training data size. We finally provide an insightful exploration into the amount of target data facilitating the adaptation of pre-trained models to downstream tasks. The code is available at https://github.com/WangYZ1608/Self-Prompt-Tuning.

Revisiting the Power of Prompt for Visual Tuning

TL;DR

Problem: how to efficiently adapt pre-trained Vision Transformers to downstream tasks, particularly with self-supervised pretraining where prompt tuning can underperform. Approach: Self-Prompt Tuning (SPT) initializes prompts with downstream token prototypes derived from patch embeddings and uses a lightweight token-construction pipeline to boost convergence while keeping the backbone frozen ( learnable parameters or less). Findings: under MAE pretraining, SPT delivers to relative gains over VPT and can outperform full fine-tuning in many tasks; MoCo-v3 and supervised pretraining also benefit, with SPT robust to prompt length and scalable to larger models. Significance: provides a practical, data-efficient alternative to full fine-tuning for large CV models in SSL contexts, with extensive ablations supporting its generality and scalability.

Abstract

Visual prompt tuning (VPT) is a promising solution incorporating learnable prompt tokens to customize pre-trained models for downstream tasks. However, VPT and its variants often encounter challenges like prompt initialization, prompt length, and subpar performance in self-supervised pretraining, hindering successful contextual adaptation. This study commences by exploring the correlation evolvement between prompts and patch tokens during proficient training. Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes. The strategic initialization, a stand-in for the previous initialization, substantially improves performance in fine-tuning. To refine further, we optimize token construction with a streamlined pipeline that maintains excellent performance with almost no increase in computational expenses compared to VPT. Exhaustive experiments show our proposed approach outperforms existing methods by a remarkable margin. For instance, it surpasses full fine-tuning in 19 out of 24 tasks, using less than 0.4% of learnable parameters on the FGVC and VTAB-1K benchmarks. Notably, our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%. Besides, the experimental results demonstrate the proposed SPT is robust to prompt lengths and scales well with model capacity and training data size. We finally provide an insightful exploration into the amount of target data facilitating the adaptation of pre-trained models to downstream tasks. The code is available at https://github.com/WangYZ1608/Self-Prompt-Tuning.
Paper Structure (19 sections, 13 equations, 8 figures, 7 tables)

This paper contains 19 sections, 13 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Self-Prompt Tuning.Left: We input a batch of the training data from the downstream task into the pre-trained model to get the forward patch embeddings. Right: We initialize prompts with sampled patch embeddings. Similar to VPT, we proposed SPT-Shallow and SPT-Deep depending on the layers involved. Only the prompts and task head parameters are learnable during adaptation on downstream tasks while the transformer encoder is frozen.
  • Figure 2: VPT presents the behavior of the Normalized Mutual Information (NMI estevez2009normalized) between prompts and patch tokens gradually increases during fine-tuning time. SPT has large NMI at the beginning, which will facilitate rapid convergence and achieve more advanced results.
  • Figure 3: Ablation studies on several basic components using Masked Autoencoder (MAE) pre-trained backbone and SPT-deep evaluated on CUB-200-2011. (a) The k-means cluster strategy achieves further improvement with more data used to construct prompts. However, it introduces significant time costs during construct prompts. The wall-clock time is displayed in ($\cdot$). (b) Prompts should be constructed using data from downstream tasks. (c) SPT is robust to prompt length changes and achieves slight gains with increasing prompt length. (d) SPT presents better scaling behavior than VPT with scaling up model size. These observations under supervised pre-trained backbone are similar (see the Appendix).
  • Figure 4: The impact of varying tuning data sizes with MAE and supervised pretraining.
  • Figure S1: The IN-21K supervised ViT-B counterpart of Fig. \ref{['fig:ablation_componments']} on several major components.
  • ...and 3 more figures