Table of Contents
Fetching ...

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

TL;DR

This work tackles the gap between image-level CLIP pretraining and dense open-vocabulary segmentation by analyzing visual discriminability across layer, head, and token levels. It introduces LHT-CLIP, a training-free framework comprising Abnormal Token Replacement (ATR), Spatial-Semantic Reweighting (SSR), and Selective Head Enhancement (SHE) to restore fine-grained visual detail while preserving semantic alignment. Across eight segmentation benchmarks, LHT-CLIP delivers state-of-the-art results and demonstrates robustness across backbones (ViT-B/16 and ViT-L/14) and even beyond CLIP to SigLIP. The method offers practical, plug-and-play improvements with minimal overhead, making it well-suited for real-world open-vocabulary segmentation tasks.

Abstract

Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

TL;DR

This work tackles the gap between image-level CLIP pretraining and dense open-vocabulary segmentation by analyzing visual discriminability across layer, head, and token levels. It introduces LHT-CLIP, a training-free framework comprising Abnormal Token Replacement (ATR), Spatial-Semantic Reweighting (SSR), and Selective Head Enhancement (SHE) to restore fine-grained visual detail while preserving semantic alignment. Across eight segmentation benchmarks, LHT-CLIP delivers state-of-the-art results and demonstrates robustness across backbones (ViT-B/16 and ViT-L/14) and even beyond CLIP to SigLIP. The method offers practical, plug-and-play improvements with minimal overhead, making it well-suited for real-world open-vocabulary segmentation tasks.

Abstract

Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

Paper Structure

This paper contains 29 sections, 6 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Layer-wise analysis of visual discriminability (blue) and semantic alignment (orange) within the CLIP vision encoders across different datasets. The final layer is excluded from the analysis to avoid discrepancies caused by prior modifications to the last-layer in different methods.
  • Figure 2: Abnormal token phenomenon in attention maps across different layers of the ViT-B/16 model used as the CLIP vision encoder. Attention maps are computed with respect to specific visual token positions, denoted by $\textcolor{red}{\boldsymbol{\times}}$ (e.g., the “child” token in the top row and the “car” token in the bottom row). Representative abnormal tokens are highlighted with orange boxes.
  • Figure 3: Illustration of the sparsity and high-norm characteristics of abnormal tokens. Figure (a) shows the attention map of the red anchor token $\textcolor{red}{\boldsymbol{\times}}$. Figures (b)–(d) depict the channel activations of a normal token (red $\textcolor{red}{\boldsymbol{\times}}$) and two abnormal tokens (orange $\textcolor{orange}{\boldsymbol{\star}}$ and blue $\textcolor{blue}{\boldsymbol{\star}}$) highlighted in Figure (a). Figure (e) presents the hoyer score distribution across layers and token positions.
  • Figure 4: Layer-wise cosine similarity among abnormal tokens across positions, layers and samples.
  • Figure 5: Head-wise visual discriminability analysis across multiple datasets using the ViT-B/16 backbone. The dashed lines in different colors denote the corresponding layer-wise visual discriminability scores. For clarity, heads from three layers (i.e., the 6th, 7th, and 8th layers) are displayed.
  • ...and 5 more figures