Table of Contents
Fetching ...

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang

TL;DR

This work addresses the challenge of applying CLIP-style vision-language models to open-vocabulary semantic segmentation, where segmentation maps are often noisy due to local localization issues. By decomposing CLIP’s vision output into a residual component $X_{\text{res}}$ and an attention component $X_{\text{attn}}$, the authors show that the residual path dominates noise and undermines local discriminability. They propose ClearCLIP, consisting of three simple final-layer changes: remove the residual connection, adopt self-self attention ($Attn_{qq}$), and discard the Feed-Forward Network (FFN), effectively boosting the informative attention signal and resulting in clearer segmentation maps. Across eight benchmarks and multiple backbones, ClearCLIP yields consistent improvements over existing training-free and weakly supervised methods, highlighting the practical value of representation decomposition for dense vision-language inference.

Abstract

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

TL;DR

This work addresses the challenge of applying CLIP-style vision-language models to open-vocabulary semantic segmentation, where segmentation maps are often noisy due to local localization issues. By decomposing CLIP’s vision output into a residual component and an attention component , the authors show that the residual path dominates noise and undermines local discriminability. They propose ClearCLIP, consisting of three simple final-layer changes: remove the residual connection, adopt self-self attention (), and discard the Feed-Forward Network (FFN), effectively boosting the informative attention signal and resulting in clearer segmentation maps. Across eight benchmarks and multiple backbones, ClearCLIP yields consistent improvements over existing training-free and weakly supervised methods, highlighting the practical value of representation decomposition for dense vision-language inference.

Abstract

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.
Paper Structure (31 sections, 4 equations, 11 figures, 5 tables)

This paper contains 31 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Left: Example of open-vocabulary semantic segmentation. CLIP radford2021learning fails to localize the object. MaskCLIP zhou2022extract can localize the foreground and background but still exhibits significant noise. Our proposed method, ClearCLIP, achieves high-quality segmentation map. Our key insight is that vanilla CLIP's segmentation map can be decomposed into a cluttered map of residual connection and a clearer and smoother map of attention output from the last transformer layer. Right: Comparison of open-vocabulary semantic segmentation performance.
  • Figure 2: Comparison of norms and mIoU of different attention mechanisms for CLIP-B/16 (left) and CLIP-L/14 (right). The norm curve of $X_\textup{attn}$ shows a positive correlation with the mIoU curve. A larger norm of $X_\textup{res}$ in CLIP-L/14 impedes the enhancement of performance through the revision of attention mechanisms.
  • Figure 3: Open-vocabulary semantic segmentation using different feature maps of CLIP-B/16 model on the COCOStuff dataset. A visualization of an example (left) and quantitative results (right).
  • Figure 4: Statistics of three feature maps for DINO-B/16 and CLIP-B/16.
  • Figure 5: Ablation study on different architectures and different attention mechanisms.
  • ...and 6 more figures