Table of Contents
Fetching ...

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Jiedong Zhuang, Jiaqi Hu, Lianrui Mu, Rui Hu, Xiaoyu Liang, Jiangnan Ye, Haoji Hu

TL;DR

This work tackles the limitation of content-altering visual prompts in CLIP by proposing FALIP, a train-free foveal-attention mechanism that injects region-guided masks directly into CLIP's self-attention without modifying the input image. By aligning regions of attention with foveal masks, FALIP enhances zero-shot performance across referring expression comprehension, image classification, and 3D point cloud recognition, while remaining plug-and-play and computationally light. The authors provide extensive analyses showing how visual prompts influence attention, and demonstrate that selectively unleashing certain attention heads can further boost gains beyond the baseline prompts. Overall, FALIP offers a practical and scalable way to harness the benefits of visual prompts through attention modulation, with strong empirical results and insights into the role of head-specific responsiveness in CLIP.

Abstract

CLIP has achieved impressive zero-shot performance after pre-training on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

TL;DR

This work tackles the limitation of content-altering visual prompts in CLIP by proposing FALIP, a train-free foveal-attention mechanism that injects region-guided masks directly into CLIP's self-attention without modifying the input image. By aligning regions of attention with foveal masks, FALIP enhances zero-shot performance across referring expression comprehension, image classification, and 3D point cloud recognition, while remaining plug-and-play and computationally light. The authors provide extensive analyses showing how visual prompts influence attention, and demonstrate that selectively unleashing certain attention heads can further boost gains beyond the baseline prompts. Overall, FALIP offers a practical and scalable way to harness the benefits of visual prompts through attention modulation, with strong empirical results and insights into the role of head-specific responsiveness in CLIP.

Abstract

CLIP has achieved impressive zero-shot performance after pre-training on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.
Paper Structure (23 sections, 6 equations, 15 figures, 14 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 15 figures, 14 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of visual prompt based methods and FALIP. Left is the the visual prompt methodsCPTreclipredcircle. They perform image editing (such as colored boxes, cropping, circles, blur masks, etc.) enabling CLIP to perceive specific regions. Bottom right is FALIP. It does not alter the content of the original image. The gray dashed line represents the attention of model. Compared to the original CLIP, FALIP aligns more with human visual characteristics.
  • Figure 2: The shift in the model's attention before and after incorporating visual prompts. It can be observed that visual prompts can guide the model's attention to specific regions.
  • Figure 3: FALIP Overview. We first input the image into the foveal attention generation module to obtain a foveal attention mask. Then, we input original images to the CLIP image encoder, while also providing the foveal attention mask to the Multi-head Self-Attention (MSA) module. With different input images and text prompts, the model can accomplish tasks such as referring expression comprehension, image classification and 3D point cloud recognition.
  • Figure 4: Visualization of referring expression comprehension. The model predicts the corresponding object in the image based on the given referring expression. The key words in referring expression is colored orange.
  • Figure 5: Pipeline of 3D point cloud recognition. Left: The overall framework remains consistent with PointCLIPpointclip, with the difference being the insertion of foveal attention in the CLIP image encoder. Right: Attention on the 2D depth maps of original CLIP and our method. It can be observed that our method shows a stronger attention towards the foreground.
  • ...and 10 more figures