SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation
Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao
TL;DR
This work targets the modality gap and affinity noise in CLIP-based weakly supervised semantic segmentation. It introduces Cross-Modal Prototype Alignment (CMPA) to tightly align image and text representations and Prototype Contrastive Learning, plus Superpixel-Guided Correction (SGC) to constrain spatial propagation using superpixel priors. Together, CMPA and SGC improve both semantic accuracy and boundary localization, yielding state-of-the-art mIoU on PASCAL VOC and MS COCO and narrowing the gap to fully supervised results. The methods are validated through extensive ablations and visualization, demonstrating robust improvements over prior CLIP-based WSSS approaches.
Abstract
In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.
