Table of Contents
Fetching ...

Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

Zhiwei Yang, Yucong Meng, Kexue Fu, Feilong Tang, Shuo Wang, Zhijian Song

TL;DR

This work targets weakly supervised semantic segmentation with image-level labels by unlocking CLIP's dense patch-text knowledge. It introduces ExCEL, a patch-text alignment framework built on Text Semantic Enrichment (TSE) and Visual Calibration (VC), including Static Visual Calibration (SVC) and Learnable Visual Calibration (LVC). By constructing a dataset-wide attribute space from LLM-generated descriptions and calibrating CLIP's visual features in a non-parametric and parametric manner, ExCEL achieves strong CAM quality and segmentation with significantly reduced training cost. Experiments on PASCAL VOC and MS COCO demonstrate state-of-the-art results for a training-efficient, single-stage approach, highlighting the practical impact of dense CLIP knowledge for WSSS.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Activation Maps (CAMs). Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced in WSSS. However, recent methods primarily focus on image-text alignment for CAM generation, while CLIP's potential in patch-text alignment remains unexplored. In this work, we propose ExCEL to explore CLIP's dense knowledge via a novel patch-text alignment paradigm for WSSS. Specifically, we propose Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules to improve the dense alignment across both text and vision modalities. To make text embeddings semantically informative, our TSE module applies Large Language Models (LLMs) to build a dataset-wide knowledge base and enriches the text representations with an implicit attribute-hunting process. To mine fine-grained knowledge from visual features, our VC module first proposes Static Visual Calibration (SVC) to propagate fine-grained knowledge in a non-parametric manner. Then Learnable Visual Calibration (LVC) is further proposed to dynamically shift the frozen features towards distributions with diverse semantics. With these enhancements, ExCEL not only retains CLIP's training-free advantages but also significantly outperforms other state-of-the-art methods with much less training cost on PASCAL VOC and MS COCO.

Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

TL;DR

This work targets weakly supervised semantic segmentation with image-level labels by unlocking CLIP's dense patch-text knowledge. It introduces ExCEL, a patch-text alignment framework built on Text Semantic Enrichment (TSE) and Visual Calibration (VC), including Static Visual Calibration (SVC) and Learnable Visual Calibration (LVC). By constructing a dataset-wide attribute space from LLM-generated descriptions and calibrating CLIP's visual features in a non-parametric and parametric manner, ExCEL achieves strong CAM quality and segmentation with significantly reduced training cost. Experiments on PASCAL VOC and MS COCO demonstrate state-of-the-art results for a training-efficient, single-stage approach, highlighting the practical impact of dense CLIP knowledge for WSSS.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels aims to achieve pixel-level predictions using Class Activation Maps (CAMs). Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced in WSSS. However, recent methods primarily focus on image-text alignment for CAM generation, while CLIP's potential in patch-text alignment remains unexplored. In this work, we propose ExCEL to explore CLIP's dense knowledge via a novel patch-text alignment paradigm for WSSS. Specifically, we propose Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules to improve the dense alignment across both text and vision modalities. To make text embeddings semantically informative, our TSE module applies Large Language Models (LLMs) to build a dataset-wide knowledge base and enriches the text representations with an implicit attribute-hunting process. To mine fine-grained knowledge from visual features, our VC module first proposes Static Visual Calibration (SVC) to propagate fine-grained knowledge in a non-parametric manner. Then Learnable Visual Calibration (LVC) is further proposed to dynamically shift the frozen features towards distributions with diverse semantics. With these enhancements, ExCEL not only retains CLIP's training-free advantages but also significantly outperforms other state-of-the-art methods with much less training cost on PASCAL VOC and MS COCO.

Paper Structure

This paper contains 16 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our motivation. (a) Previous methods leverage CLIP to generate CAMs with global image-text alignment, leaving CLIP's dense knowledge unexplored. (b) The proposed ExCEL explores CLIP's dense knowledge via a novel patch-text alignment paradigm, which generates better CAMs with less training cost.
  • Figure 2: ExCEL Architecture. We explore CLIP's dense knowledge with Text Semantic Enrichment (TSE) and Visual Calibration (VC). (a) TSE uses LLMs to build a knowledge base and clusters it into an implicit attribute space. The final text representation $T_c$ is enhanced by hunting for relevant attributes. For vision modality, (b) we introduce Static Visual Calibration (SVC) to calibrate visual features using the Inter-correlation operation across $N$ intermediate layers. It generates static CAMs with $T_c$ and calibrated features $P_s$. (c) Learnable Visual Calibration (LVC) designs a learnable adapter to add a dynamic shift $R$ to SVC. It generates optimized features $P_d$ based on static CAMs guidance, creating dynamic CAMs from $P_d$ and $T_c$. Dynamic CAMs are refined for segmentation supervision. Details are in \ref{['sec.3.1']}.
  • Figure 3: Segmentation visualizations of SeCo SeCo, WeCLIP 18 and ours on VOC and COCO. ExCEL segments objects more precisely.
  • Figure 4: CAM visualizations on VOC train set. (a) Image. (b-e) Ablative visualizations of proposed modules. (e-h) Qualitative comparisons of (e) ExCEL and recent CLIP-based methods, i.e., (f) WeCLIP 18, (g) CLIP-ES 16 and (h) MaskCLIP 27. (i) Ground truth.
  • Figure 5: Implicit attribute responses. Based on the TOPK similarity scores, 5 attributes are sampled for visualizations.
  • ...and 1 more figures