Table of Contents
Fetching ...

SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao

TL;DR

This work targets the modality gap and affinity noise in CLIP-based weakly supervised semantic segmentation. It introduces Cross-Modal Prototype Alignment (CMPA) to tightly align image and text representations and Prototype Contrastive Learning, plus Superpixel-Guided Correction (SGC) to constrain spatial propagation using superpixel priors. Together, CMPA and SGC improve both semantic accuracy and boundary localization, yielding state-of-the-art mIoU on PASCAL VOC and MS COCO and narrowing the gap to fully supervised results. The methods are validated through extensive ablations and visualization, demonstrating robust improvements over prior CLIP-based WSSS approaches.

Abstract

In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.

SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

TL;DR

This work targets the modality gap and affinity noise in CLIP-based weakly supervised semantic segmentation. It introduces Cross-Modal Prototype Alignment (CMPA) to tightly align image and text representations and Prototype Contrastive Learning, plus Superpixel-Guided Correction (SGC) to constrain spatial propagation using superpixel priors. Together, CMPA and SGC improve both semantic accuracy and boundary localization, yielding state-of-the-art mIoU on PASCAL VOC and MS COCO and narrowing the gap to fully supervised results. The methods are validated through extensive ablations and visualization, demonstrating robust improvements over prior CLIP-based WSSS approaches.

Abstract

In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.

Paper Structure

This paper contains 29 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our motivation. (a) Previous methods exhibit inherent limitations. (b) The initial CAM stage suffers from a modality gap, where the visual feature space faces dual challenges of intra-class dispersion and inter-class overlap. (c) The refined CAM stage is plagued by spurious background responses, as affinity estimation is corrupted by background noise. (d) To address these issues, we propose SSR.
  • Figure 2: Overview of our SSR.We propose two novel components to address the key challenges of modality gap and erroneous activation: the CMPA and SGC. (a) The CMPA utilizes cross-modal prototype contrastive learning to establish precise matching relationships between visual features and textual prototypes in a shared embedding space, thereby effectively alleviating class confusion. (b) The SGC utilizes local spatial consistency priors derived from superpixel clustering to selectively filter the feature affinity matrix selectively, eliminating erroneous cross-region propagation and guiding the feature refinement process toward semantically consistent directions, thereby significantly suppressing background over-activation phenomena.
  • Figure 3: Segmentation visualizations of SeCo, DUPL, WeCLIP, MoRe, and Ours on VOC and COCO. Columns 1-4: Results on PASCAL VOC dataset. Columns 5-7: Results on the MS COCO dataset. SSR segments objects more precisely.
  • Figure 4: CAM visualizations on VOC val set.We conduct a comparative analysis between the initial CAMs generated by CLIP and those produced by our CMPA, followed by evaluating the performance gap between WeCLIP's results and our final optimized outputs.
  • Figure 5: Visualization of CAM refinement: (c) Initial CAM with background artifacts; (d) Refined CAM after SGC processing, showing cleaner background suppression and sharper target focus.
  • ...and 1 more figures