Table of Contents
Fetching ...

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, Yanwei Pang

TL;DR

This paper presents a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP and achieves the state-of-the-art performance on these datasets.

Abstract

Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed CLIPer achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, CLIPer has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

TL;DR

This paper presents a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP and achieves the state-of-the-art performance on these datasets.

Abstract

Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed CLIPer achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, CLIPer has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

Paper Structure

This paper contains 11 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison with existing CLIP-based open-vocabulary semantic segmentation approaches without training. In (a), several approaches Wang_2024_SCLIPLan_2024_ClearCLIPmaskclip replace the original self-attention map at last layer with self-self attention map, which can better maintain spatial coherence. In (b), the method ProxyCLIP Lan_2014_ProxyCLIP opts for a different strategy, replacing original self-attention map with vision foundation model-based (VFM-based) attention map. In (c), we utilize the embeddings and self-attention maps at early layers to fully exploit spatial information within CLIP. Subsequently, we perform fine-grained compensation using diffusion model to further improve local details.
  • Figure 2: Visualization of path embeddings in CLIP. In (a), we visualize the embeddings into a 3D space and observe that the early embeddings exhibit good spatial coherence. In (b), we evaluate the cosine similarity between the embeddings of early layers at a specific point and the embeddings at last layer, revealing that the earlier and last embeddings share a similar embedding space.
  • Figure 3: Visualization of self-attention maps between CLIP and Stable Diffusion (SD). We show the self-attention maps at selected points for both CLIP and SD. Compared to that of CLIP, we observe that the self-attention maps of SD focus more on capturing local details.
  • Figure 4: Overall architecture of our proposed method CLIPer. Our CLIPer contains two components: early-layer fusion and fine-grained compensation. In the early-layer fusion, we aggregate early-layer information of CLIP image encoder, including embeddings and attention maps, to improve spatial coherence of output embeddings, which are used to generate coarse segmentation map with text embeddings. The fine-grained compensation aims to employ self-attention maps of Stable Diffusion to refine local details of coarse segmentation map.
  • Figure 5: Qualitative comparison with existing methods. We show the segmentation results on three different datasets. Compared to these methods, our method has more accurate segmentation results which are closer to the ground-truths.
  • ...and 1 more figures