Table of Contents
Fetching ...

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Jingyun Wang, Cilin Yan, Guoliang Kang

TL;DR

This work tackles training-free open-vocabulary semantic segmentation by harnessing CLIP’s global knowledge, which prior TF-OVSS methods often discard to emphasize locality. It introduces GCLIP, a framework with two modules: Attention Map Fusion (AMF) to inject image-level global properties into the last-block attention by fusing global-token emerging block attentions with the final block, and Channel Suppression (CS) to enforce semantic coherence among Value embeddings via targeted re-normalization of a problematic FFN channel. Empirically, GCLIP achieves state-of-the-art results on five benchmarks (e.g., Cityscapes +3.7% mIoU over ClearCLIP) and demonstrates robustness across multiple pre-trained VLM backbones, with ablations confirming the contributions of AMF and CS. The study shows that CLIP’s global knowledge can be effectively mined and leveraged for dense prediction without additional training, enabling stronger generalization to unseen categories in open-vocabulary segmentation.

Abstract

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

TL;DR

This work tackles training-free open-vocabulary semantic segmentation by harnessing CLIP’s global knowledge, which prior TF-OVSS methods often discard to emphasize locality. It introduces GCLIP, a framework with two modules: Attention Map Fusion (AMF) to inject image-level global properties into the last-block attention by fusing global-token emerging block attentions with the final block, and Channel Suppression (CS) to enforce semantic coherence among Value embeddings via targeted re-normalization of a problematic FFN channel. Empirically, GCLIP achieves state-of-the-art results on five benchmarks (e.g., Cityscapes +3.7% mIoU over ClearCLIP) and demonstrates robustness across multiple pre-trained VLM backbones, with ablations confirming the contributions of AMF and CS. The study shows that CLIP’s global knowledge can be effectively mined and leveraged for dense prediction without additional training, enabling stronger generalization to unseen categories in open-vocabulary segmentation.

Abstract

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

Paper Structure

This paper contains 13 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Experiments with CLIP ViT-B/16. (a) Emergence of global tokens (best viewed in color). Global tokens (highlight stripes in Line 1) start to emerge from the attention map of block 6. Comparing the attention maps after block 6, we observe the attention pattern of global tokens aligns well with that of the [Cls] token (Line 2&3). (b) Channel Suppression (CS). We observe the entropy of weight norms decreases abnormally from block 7 in (2). With CS on the abnormal weight norm of the second fully-connected layer of FFN in a Transformer block (See (3)), we enhance the semantic correlation by making value embeddings of patches within the same semantic mask become more similar ("in-in") but those from different masks become more dissimilar ("in-out").
  • Figure 2: Method Overview. (a) Overview. In this paper, we propose a new framework GCLIP, consisting of Attention Map Fusion (AMF) and Channel Suppression (CS), for Training-Free Open-Vocabulary Semantic Segmentation. (b) Attention Map Fusion. We fuse the attentions of early global-token emerging blocks ($L_g$,$L_{g+1}$, $\cdots$) with the Query-Query attention of the last-block ($L_{f}$) to emphasize the effect of global knowledge. (c) Channel Suppression. We suppress the weight norm of the specific output channel $\hat{d}$ of FFN by a re-nomalizing operation $\varphi$ as depicted in Eq. (\ref{['formula:renormalize']}) to enhance the semantic correlation of Value embeddings.
  • Figure 3: Weight Norms of the second fully-connected layer in FFNs. Starting from block 5 (CLIP ViT-B/16), we observe FFN's second fully connected layer weight norm corresponding to a specific output channel becomes unexpectedly larger than the weight norm of other channels.
  • Figure 4: Qualitative Results. We visualize the segmentation results of GCLIP on both PASCAL VOC and PASCAL Context. We observe that the masks generated by ClearCLIP usually fail to segment the integral target object because it may confuse semantically similar categories without sufficient global context. GCLIP extracts semantically correlated patch-level image features through enhancing global context information. The masks generated by GCLIP obviously outperform those of both vanilla CLIP and ClearCLIP.