Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
Tong Shao, Zhuotao Tian, Hang Zhao, Jingyong Su
TL;DR
This paper analyzes why CLIP, despite strong zero-shot capabilities, struggles with dense open-vocabulary segmentation due to its image-level alignment and the emergence of global patches that diminish local patch correlations. It introduces CLIPtrase, a training-free framework composed of Semantic Correlation Recovery, Patch Clustering, and Denoising to restore patch-level semantic relationships and produce region-wise masks, enabling open-vocabulary segmentation without additional training. Extensive experiments across 9 benchmarks show significant improvements over CLIP and other training-free methods, with the approach approaching certain trainable baselines while remaining training-free. The method also integrates with SAM to further refine boundaries, highlighting practical impact for scalable, zero-shot segmentation in real-world settings, albeit with a remaining gap to the absolute SOTA trainable models.
Abstract
CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.
