Table of Contents
Fetching ...

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

Tong Shao, Zhuotao Tian, Hang Zhao, Jingyong Su

TL;DR

This paper analyzes why CLIP, despite strong zero-shot capabilities, struggles with dense open-vocabulary segmentation due to its image-level alignment and the emergence of global patches that diminish local patch correlations. It introduces CLIPtrase, a training-free framework composed of Semantic Correlation Recovery, Patch Clustering, and Denoising to restore patch-level semantic relationships and produce region-wise masks, enabling open-vocabulary segmentation without additional training. Extensive experiments across 9 benchmarks show significant improvements over CLIP and other training-free methods, with the approach approaching certain trainable baselines while remaining training-free. The method also integrates with SAM to further refine boundaries, highlighting practical impact for scalable, zero-shot segmentation in real-world settings, albeit with a remaining gap to the absolute SOTA trainable models.

Abstract

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

TL;DR

This paper analyzes why CLIP, despite strong zero-shot capabilities, struggles with dense open-vocabulary segmentation due to its image-level alignment and the emergence of global patches that diminish local patch correlations. It introduces CLIPtrase, a training-free framework composed of Semantic Correlation Recovery, Patch Clustering, and Denoising to restore patch-level semantic relationships and produce region-wise masks, enabling open-vocabulary segmentation without additional training. Extensive experiments across 9 benchmarks show significant improvements over CLIP and other training-free methods, with the approach approaching certain trainable baselines while remaining training-free. The method also integrates with SAM to further refine boundaries, highlighting practical impact for scalable, zero-shot segmentation in real-world settings, albeit with a remaining gap to the absolute SOTA trainable models.

Abstract

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.
Paper Structure (25 sections, 14 equations, 10 figures, 8 tables)

This paper contains 25 sections, 14 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison between our method and CLIP. (a): Performance comparison on open-vocabulary semantic segmentation of our model and CLIP. (b): Comparison of randomly selected patch attention response heatmaps between our model and CLIP. The red dot in the picture is the selected patch position.
  • Figure 2: Visualization of the "global" patch phenomenon in the attention map of different layers of ViTvit in the CLIP visual branch. Inter-patch attention map is the attention weight map between all patch features, the size is 196*196. [CLS] token attention map is the attention weight matrix of [CLS] token on all patch features, interpolates from 14*14 to 224*224, and displays in the form of a heat map. 90-th patch attention map is the attention weight of a randomly selected patch, the red dot in the image is the selected patch position. Its display method is the same as the [CLS] token. More visualization are presented in supplementary file.
  • Figure 3: Illustration of the key components for our CLIPtrase framework. The semantic correlation operation restores the semantic location between patches. While the restored $w$ continues to forward to obtain CLIP visual features, we use clustering to obtain prototype attention weights of different categories and generate masks to improve classification results and refine object boundaries. All modules in the model are frozen to accomplish training-free setting.
  • Figure 4: CLIP attention map before and after semantic correlation recovery. The red dot indicates the selected patch position. Our method significantly restores the correlation between adjacent or semantically similar patches.
  • Figure 5: Examples of clustering and denoising process. It can be clearly seen that there is noise caused by global patch in the results without denoising.
  • ...and 5 more figures