Table of Contents
Fetching ...

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

TL;DR

This work introduces a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP that achieves superior zero-shot performance on segmentation benchmarks.

Abstract

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

TL;DR

This work introduces a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP that achieves superior zero-shot performance on segmentation benchmarks.

Abstract

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP
Paper Structure (19 sections, 5 equations, 8 figures, 18 tables)

This paper contains 19 sections, 5 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Motivation illustration of Cascade-CLIP. The cosine similarity map (above) indicates the visual features from the intermediate layers of CLIP pmlr-v139-radford21a layers can capture richer local object details compared to the last one (Layer 12).
  • Figure 2: Three zero-shot segmentation approaches based on CLIP. (a) ZegCLIP relies on the last-layer visual features without considering information from intermediate layers. (b) Inspired by SegFormer xie2021segformer, we fuse both intermediate- and last-layers features to enhance feature representation, yet this integration disrupts the correlation between text and visual features. (c) To alleviate this issue, our Cascade-CLIP separats the image encoder and aligns independent text-image decoders for deep features and middle features respectively, and finally cascades the segmentation results.
  • Figure 3: Architecture of our Cascade-CLIP. The CLIP visual encoder is divided into multiple stages. Then, we employ the NGA module to aggregate features of blocks within each stage and assign an independent text-image decoder for aggregated visual features and non-sharing text embeddings. In the text-image decoder (right part of the figure), the segmentation mask could be calculated by the scaled dot product attention via the Multihead Attention (Attn) layers, inspired by zhang2022segvit. Finally, we combine the multi-level semantic masks produced by different cascaded decoders to enhance segmentation predictions. (Please refer to Sec. \ref{['sec: NGA']} for details.)
  • Figure 4: Centered kernel alignment heatmap kornblith2019similarity between layers of (a) Original CLIP and (b) Cascade-CLIP (Ours). The last row (red box) shows the similarity between features from the last layer and other layers. The green box illustrates the similarity between adjacent layers.
  • Figure 5: Qualitative transductive results on COCO-Stuff 164K. The black and red tags represent seen and unseen classes, respectively.
  • ...and 3 more figures