Table of Contents
Fetching ...

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

Bingfeng Zhang, Siyue Yu, Yunchao Wei, Yao Zhao, Jimin Xiao

TL;DR

The frozen CLIP model is applied as the backbone for semantic feature extraction, and a new decoder is designed to interpret extracted semantic features for final prediction and a refinement module (RFM) is proposed to rectify them dynamically.

Abstract

Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model, while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper, we propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction, and a new decoder is designed to interpret extracted semantic features for final prediction. Meanwhile, we utilize the above frozen backbone to generate pseudo labels for training the decoder. Such labels cannot be optimized during training. We then propose a refinement module (RFM) to rectify them dynamically. Our architecture enforces the proposed decoder and RFM to benefit from each other to boost the final performance. Extensive experiments show that our approach significantly outperforms other approaches with less training cost. Additionally, our WeCLIP also obtains promising results for fully supervised settings. The code is available at https://github.com/zbf1991/WeCLIP.

Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

TL;DR

The frozen CLIP model is applied as the backbone for semantic feature extraction, and a new decoder is designed to interpret extracted semantic features for final prediction and a refinement module (RFM) is proposed to rectify them dynamically.

Abstract

Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model, while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper, we propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction, and a new decoder is designed to interpret extracted semantic features for final prediction. Meanwhile, we utilize the above frozen backbone to generate pseudo labels for training the decoder. Such labels cannot be optimized during training. We then propose a refinement module (RFM) to rectify them dynamically. Our architecture enforces the proposed decoder and RFM to benefit from each other to boost the final performance. Extensive experiments show that our approach significantly outperforms other approaches with less training cost. Additionally, our WeCLIP also obtains promising results for fully supervised settings. The code is available at https://github.com/zbf1991/WeCLIP.
Paper Structure (20 sections, 14 equations, 8 figures, 12 tables)

This paper contains 20 sections, 14 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparisons between our approach and other single-stage or CLIP-based approaches. (a) Previous single-stage approach, which uses a trainable ImageNet deng2009imagenet pre-trained backbone with trainable classification and segmentation process. (b) Previous CLIP-based approach, which is a multi-stage approach that uses the Frozen CLIP model to produce pseudo labels and trains an individual ImageNet pre-trained segmentation model. (c) Our approach. Our approach is a single-stage approach that uses a frozen CLIP model as the backbone with a trainable segmentation process, significantly reducing the training cost.
  • Figure 2: Framework of our WeCLIP. The image is input to the Frozen CLIP image encoder to generate the image features, and class labels are used to build text prompts and then input to the Frozen CLIP text encoder to generate the text features. The classification scores are generated based on the distance between the pooled image and text features. Using GradCAM, we can generate the initial CAM $M_{\text{init}}$. Then, the frozen image features from the last layer of each transformer block are input to our decoder to generate the final semantic segmentation predictions. Meanwhile, the affinity map $A_f$ from our decoder and the multi-head attention maps $A_s$ from CLIP are input to our RFM to establish refining maps $R$ to refine $M_{\text{init}}$ as $M_f$. After post-processing, it will be used as the supervision to train our decoder.
  • Figure 3: Qualitative comparison about the CAM. (a) Initial CAM. (b) Refined CAM by attention maps proposed in lin2023clip. (c) Our refined CAM. Our method produces more accurate responses.
  • Figure 4: Qualitative comparisons bewteen our approach and ToCo ru2023token on PASCAL VOC 2012 and MS COCO-2014 val set. Our approach generates more detailed visual results.
  • Figure 5: Feature visualization with T-SNE van2008visualizing to show why frozen CLIP can be used for semantic segmentation. Each color represents one specific category. (a) Frozen ImageNet pre-trained feature visualization of ViT-B. (b) Frozen CLIP pre-trained feature visualization of VIT-B. It can be seen that without any retraining, the features belonging to the same class from the frozen CLIP are more compact compared with that in (a). Best viewed in color.
  • ...and 3 more figures