Table of Contents
Fetching ...

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang

TL;DR

This work tackles open-vocabulary segmentation by addressing CLIP's weakness in capturing local details due to image-level pretraining. It introduces Self-Calibrated CLIP (SC-CLIP), a training-free calibration that first resolves anomaly tokens to prevent uniform attention, then exploits semantic coherence in mid-level CLIP features to self-adjust deep representations, and finally employs a two-pass, multi-level fusion to enrich detail without extra parameters or backbones. The method yields state-of-the-art results across eight datasets and backbones, with substantial gains over vanilla CLIP (up to 6.8× on ViT-L/14) and an average improvement of 9.5 percentage points over prior training-free approaches. By leveraging CLIP's own properties and internal feature structure, SC-CLIP demonstrates that dense, semantically coherent open-vocabulary segmentation can be achieved without additional training or components, offering a practical, efficient solution for dense cross-modal understanding.

Abstract

Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP's intermediate features. Furthermore, we explore how to effectively employ multi-level feature fusion under the training-free setting. Collectively, these strategies enhance CLIP's feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at https://github.com/SuleBai/SC-CLIP.

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

TL;DR

This work tackles open-vocabulary segmentation by addressing CLIP's weakness in capturing local details due to image-level pretraining. It introduces Self-Calibrated CLIP (SC-CLIP), a training-free calibration that first resolves anomaly tokens to prevent uniform attention, then exploits semantic coherence in mid-level CLIP features to self-adjust deep representations, and finally employs a two-pass, multi-level fusion to enrich detail without extra parameters or backbones. The method yields state-of-the-art results across eight datasets and backbones, with substantial gains over vanilla CLIP (up to 6.8× on ViT-L/14) and an average improvement of 9.5 percentage points over prior training-free approaches. By leveraging CLIP's own properties and internal feature structure, SC-CLIP demonstrates that dense, semantically coherent open-vocabulary segmentation can be achieved without additional training or components, offering a practical, efficient solution for dense cross-modal understanding.

Abstract

Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP's intermediate features. Furthermore, we explore how to effectively employ multi-level feature fusion under the training-free setting. Collectively, these strategies enhance CLIP's feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at https://github.com/SuleBai/SC-CLIP.

Paper Structure

This paper contains 20 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Left: Vanilla CLIP produces a noisy segmentation map, while our Self-Calibrated CLIP (SC-CLIP) generates a clearer and finer result. Right: Comparison of the open-vocabulary segmentation performance, where SC-CLIP achieves the best results.
  • Figure 2: Anomaly tokens in CLIP. (a) visualizes the attention map of various selected patches (marked by $\bigstar$), all exhibit excessive focus on the same regions (indicated by orange circle), aligning with the identified outliers in the PCA analysis in (b).
  • Figure 3: Resolving the Anomaly Tokens. (a) Illustration of the resolving process. We plot the feature map using the mean value of each token. After locating the anomaly tokens (the center of red square), we replace them with the interpolated values obtained from their neighboring regions. (b) Effect on the attention map. We highlight the changes for a normal token ($\bigstar$), and an anomaly token ($\blacktriangle$).
  • Figure 4: Top: Visualization of patch similarities shows CLIP's deep layers perform poorly, but its mid layers exhibit semantic consistency comparable to DINO. Bottom Left: ROC curve analysis further supports this, with SC-CLIP showing superior semantic coherence. Bottom Right: Detailed ROC analysis of our method.
  • Figure 5: Illustration of the self-adjusting strategy. (a) We use the similarity map from CLIP's mid layer to adaptively aggregate deep features by combining semantically similar patches, resulting in a clearer segmentation map. The second row provides a detailed view of the process for the selected patch $\bigstar$. (b) We apply the similarity map to enhance attention, broadening and refining the activation regions.
  • ...and 1 more figures