FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Jiedong Zhuang, Jiaqi Hu, Lianrui Mu, Rui Hu, Xiaoyu Liang, Jiangnan Ye, Haoji Hu
TL;DR
This work tackles the limitation of content-altering visual prompts in CLIP by proposing FALIP, a train-free foveal-attention mechanism that injects region-guided masks directly into CLIP's self-attention without modifying the input image. By aligning regions of attention with foveal masks, FALIP enhances zero-shot performance across referring expression comprehension, image classification, and 3D point cloud recognition, while remaining plug-and-play and computationally light. The authors provide extensive analyses showing how visual prompts influence attention, and demonstrate that selectively unleashing certain attention heads can further boost gains beyond the baseline prompts. Overall, FALIP offers a practical and scalable way to harness the benefits of visual prompts through attention modulation, with strong empirical results and insights into the role of head-specific responsiveness in CLIP.
Abstract
CLIP has achieved impressive zero-shot performance after pre-training on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.
