Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Jinxin Zhou; Jiachen Jiang; Zhihui Zhu

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

TL;DR

This work tackles the gap between image-level CLIP pretraining and dense open-vocabulary segmentation by analyzing visual discriminability across layer, head, and token levels. It introduces LHT-CLIP, a training-free framework comprising Abnormal Token Replacement (ATR), Spatial-Semantic Reweighting (SSR), and Selective Head Enhancement (SHE) to restore fine-grained visual detail while preserving semantic alignment. Across eight segmentation benchmarks, LHT-CLIP delivers state-of-the-art results and demonstrates robustness across backbones (ViT-B/16 and ViT-L/14) and even beyond CLIP to SigLIP. The method offers practical, plug-and-play improvements with minimal overhead, making it well-suited for real-world open-vocabulary segmentation tasks.

Abstract

Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)