Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Seungryong Yoo; Eunji Kim; Dahuin Jung; Jungbeom Lee; Sungroh Yoon

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Seungryong Yoo, Eunji Kim, Dahuin Jung, Jungbeom Lee, Sungroh Yoon

TL;DR

This paper tackles the limited effectiveness of Visual Prompt Tuning on self-supervised Vision Transformers by revealing that prompt impact depends on which ViT blocks are engaged. It introduces Gated Prompt Tuning, which learns block-wise gates to selectively route prompt influence, and Adaptive Attention Shaping, which tunes per-block attention to encode task-specific instructions. Across FGVC, VTAB-1K, and ADE20K, the proposed method consistently outperforms VPT variants for MAE and MoCo v3, often with fewer prompt tokens. The approach enhances transfer learning for SSL ViTs, offering a versatile, token-efficient strategy for both classification and dense prediction tasks.

Abstract

Visual Prompt Tuning (VPT) is an effective tuning method for adapting pretrained Vision Transformers (ViTs) to downstream tasks. It leverages extra learnable tokens, known as prompts, which steer the frozen pretrained ViTs. Although VPT has demonstrated its applicability with supervised vision transformers, it often underperforms with self-supervised ones. Through empirical observations, we deduce that the effectiveness of VPT hinges largely on the ViT blocks with which the prompt tokens interact. Specifically, VPT shows improved performance on image classification tasks for MAE and MoCo v3 when the prompt tokens are inserted into later blocks rather than the first block. These observations suggest that there exists an optimal location of blocks for the insertion of prompt tokens. Unfortunately, identifying the optimal blocks for prompts within each self-supervised ViT for diverse future scenarios is a costly process. To mitigate this problem, we propose a simple yet effective method that learns a gate for each ViT block to adjust its intervention into the prompt tokens. With our method, prompt tokens are selectively influenced by blocks that require steering for task adaptation. Our method outperforms VPT variants in FGVC and VTAB image classification and ADE20K semantic segmentation. The code is available at https://github.com/ryongithub/GatedPromptTuning.

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

TL;DR

Abstract

Paper Structure (30 sections, 11 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 11 equations, 11 figures, 8 tables, 1 algorithm.

Introduction
Preliminaries
Motivation
Proposed Method
Gated Prompt Tuning
Adaptive Attention Shaping
Comparison with VPT
Experiments
Experimental Setup
Main Results
Classification on FGVC
Classification on VTAB-1K
Semantic segmentation on ADE20K
Additional Analysis
Ablation Studies
...and 15 more sections

Figures (11)

Figure 1: Classification accuracy on the CUB and KITTI datasets with a varying location where prompt tokens are inserted in the pretrained ViT-B/16. MAE and MoCo v3 significantly improve their performances when prompt tokens are affected by the blocks after the 11th and 8th blocks, respectively. The block index denotes the initial insertion point of the prompt tokens.
Figure 2: Reconstructed images using Deep Image Prior (DIP) with pretrained ViT block's representation as a training target. The reconstructed image maintains its similarity to the original image as the block preserves information till the last block. Row 1: original image. Rows 2-4: reconstruction results for each pretrained ViTs. Poor results in late blocks (7th and 10th) of the supervised model indicate that it discards more information across blocks than the self-supervised ViTs.
Figure 3: An illustration of our proposed method, Gated Prompt Tuning. $\textbf{Z}_P^{l-1}$ and $\tilde{\textbf{Z}}_P^l$ are input and output prompt representations of $l$th block. The learnable gate $g^l$ convexly combinates $\textbf{Z}_P^{l-1}$ and $\tilde{\textbf{Z}}_P^l$ so that the $(l+1)$th block receives the prompt representation $\textbf{Z}_P^l$ in which the intervention of $l$th block into the prompt representation is adjusted by $g^l$.
Figure 4: Selection ratio $\mathbf{r}$ on the NABirds, Stanford Cars fine-grained classification and ADE20K semantic segmentation. The selection ratio represents the influence of each block on the prompt representation of the last block.
Figure 5: Visualization on self-attention map of ViT-B/16 blocks. Both prompt tuning and temperature scaling adjust the self-attention map from MAE. GATE denotes Gated Prompt Tuning and LT denotes Adaptive Attention Shaping with learnable temperatures.
...and 6 more figures

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

TL;DR

Abstract

Improving Visual Prompt Tuning for Self-supervised Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (11)