Table of Contents
Fetching ...

Learning Visual Prompts for Guiding the Attention of Vision Transformers

Razieh Rezaei, Masoud Jalili Sabet, Jindong Gu, Daniel Rueckert, Philip Torr, Ashkan Khakzar

TL;DR

Addresses guiding vision transformer attention without fine-tuning by learning a visual patch prompt via self-supervision. The method inserts a learnable patch $P$ produced by a neural prior into input images and optimizes $L_{KL}$ to align last-layer attention with a Gaussian target map $G(x,y)$ in token space. It demonstrates cross-encoder applicability to CLIP, SigLIP, DeiT, and DINOv2, achieving improvements on CUB keypoint naming and competitive results on RefCOCO while revealing shape and scale effects on attention steering. This framework offers a practical path to adapt future vision-language models to spatial prompts without dataset-bias priors or supervised finetuning.

Abstract

Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our experiments demonstrate the effectiveness of the proposed optimization-based visual prompting strategy across various pre-trained vision encoders.

Learning Visual Prompts for Guiding the Attention of Vision Transformers

TL;DR

Addresses guiding vision transformer attention without fine-tuning by learning a visual patch prompt via self-supervision. The method inserts a learnable patch produced by a neural prior into input images and optimizes to align last-layer attention with a Gaussian target map in token space. It demonstrates cross-encoder applicability to CLIP, SigLIP, DeiT, and DINOv2, achieving improvements on CUB keypoint naming and competitive results on RefCOCO while revealing shape and scale effects on attention steering. This framework offers a practical path to adapt future vision-language models to spatial prompts without dataset-bias priors or supervised finetuning.

Abstract

Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our experiments demonstrate the effectiveness of the proposed optimization-based visual prompting strategy across various pre-trained vision encoders.
Paper Structure (7 sections, 1 equation, 6 figures, 3 tables, 1 algorithm)

This paper contains 7 sections, 1 equation, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Learned Prompt for CLIPs radford2021learningCLIP, SigLIP zhai2023sigmoid, DeiT touvron2021trainingDeiT, DINOv2 oquab2023dinov2. In our framework, we learn a prompt to draw the attention of the model to a specific point where the prompt is applied. The prompt is optimized for each vision encoder model specifically to generalize across different images. The depicted image is taken from COCO lin2014microsoftCoco.
  • Figure 2: Overview of Self-supervised Prompt Optimization Framework: For a given image and a random position for the patch, the patch prior (a random noise) is first passed through a neural prior (an auto-encoder neural network). A desired mask is then applied to the patch to only partially cover the image and avoid losing data. The prompt is positioned on the target location and, after passing through the frozen Vision Encoder, the attention weights are extracted. The desired attention target values are calculated with a Gaussian distribution centered at the token corresponding to the target location. During training, the framework learns a prompt that minimizes the Kullback-Leibler (KL) divergence loss between the attention values of the CLS token and the target distribution.
  • Figure 3: Effect of Patch Parameters: Performance of CLIP models on CUB dataset for varying ratios of hollow circle and square masks over different patch sizes. The tile size of models ViT-Base and ViT-Large are $32$x$32$ and $14$x$14$, respectively. (e) shows the ratio colors and the illustration of $\lambda$.
  • Figure 4: Attention Gain with Prompt Usage throughout Layers. We applied our learned prompt to random locations on 1000 samples from the MS COCO dataset (blue line). For comparison, we also used a simple red circle prompt on the same locations (orange line). The Relative Gains are calculated by dividing the difference between attention values before and after applying the prompt on the image by the original attention values of the overlaid tokens without any prompt. A notable observation is that while the simple red circle prompt is as effective for CLIP-L/14 and CLIP-B/32, it is significantly less effective for DeiT and DINOv2 compared to the learned prompt in terms of redirecting attention to a specific location.
  • Figure 5: Learned Prompt for Different pretrained Vision Transformers. The optimal visual prompts are not the same and each model has its own unique pattern. The prompt is optimized to generalize across images. Comparing attention heatmaps of the original (on left) and prompted images (on right) reveals how effectively the prompt directs attention to specific locations. The prompt is optimized for each ViT (CLIP-B/32, CLIP-L/14, SigLIP, DeiT, and DINOv2) separately, and is optimized over 20k random samples from ImageNet deng2009imagenet. The depicted image is taken from MS COCO lin2014microsoftCoco.
  • ...and 1 more figures