Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

Nagito Saito; Shintaro Ito; Koichi Ito; Takafumi Aoki

Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

Nagito Saito, Shintaro Ito, Koichi Ito, Takafumi Aoki

TL;DR

The paper tackles the high cost of pixel-wise annotations in semantic segmentation by introducing a zero-shot pseudo-labeling pipeline that leverages Segment Anything Model (SAM) for segmentation and Contrastive Language-Image Pretraining (CLIP) for labeling. It then refines these labels with a UniMatch‑inspired perturbation framework to generate enhanced labels and trains a segmentation model with a balanced loss L = 1/2(L_s + L_u), where L_u aggregates label-smoothed cross-entropy over multiple perturbed outputs. On PASCAL VOC 2012 and COCO, the method achieves superior or competitive mIoU across label-split settings while using smaller image sizes and flexible backbones, outperforming recent semi-supervised baselines. This work reduces annotation burden in real-world domains like medical imaging and autonomous driving by enabling effective semi-supervised segmentation with zero-shot pseudo labels.

Abstract

Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO. The project web page is available at: https://gsisaoki.github.io/ZERO-SHOT-PLG/

Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

TL;DR

Abstract

Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)