Table of Contents
Fetching ...

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang, Jieru Mei, Alan Yuille

TL;DR

This work replaces the traditional self-attention block of CLIP vision encoder's last layer by the authors' CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation.

Abstract

Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

TL;DR

This work replaces the traditional self-attention block of CLIP vision encoder's last layer by the authors' CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation.

Abstract

Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.
Paper Structure (11 sections, 3 equations, 6 figures, 4 tables)

This paper contains 11 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Open-vocabulary semantic segmentation examples. We evaluate on two images from COCO coco (the 3rd and the 5th examples) and three high-resolution images in the wild, where our SCLIP consistently generates high quality segmentation masks yet the original CLIP fails to correctly localize objects. We display the corresponding text query of each segmentation mask, where "g. retriever" and "b. collie" in the first example denote golden retriever and border collie, respectively.
  • Figure 2: Final layer attention maps of vanilla CLIP with a ViT-Base/16 image encoder. We display the attention maps of four points (marked in different colors) for each example. It shows that each local visual token attends to a wide range of positions and the attention maps often share similar patterns, indicating that CLIP learns spatial-invariant visual features.
  • Figure 3: An architectural comparison between the original self-attention and our correlative self-attention mechanism. Our method determines attention scores by pairwise correlations between the local tokens.
  • Figure 4: Comparison of attention maps. We show the attention maps of the last transformer layer in CLIP vision encoder equipped with the original self-attention (right) and our correlative self-attention (left). Our correlative self-attention exhibits spatially covariant patterns as the attention maps are distinct to different source points and show clear boundaries of semantic objects (e.g., the chair and the cat).
  • Figure 5: Additional visualization results on PASCAL VOC. "GT" denotes ground truth.
  • ...and 1 more figures