Table of Contents
Fetching ...

Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models

Zhaozheng Chen, Qianru Sun

TL;DR

Weakly supervised semantic segmentation from image-level labels remains challenging due to incomplete object extents in CAMs. This paper surveys traditional WSSS methods across four taxonomy branches and evaluates the applicability of vision foundation models, notably SAM and CLIP, in both text-prompted and zero-shot settings. Empirical results show SAM-based approaches deliver superior pseudo-masks and segmentation quality, frequently rivaling fully supervised methods on VOC and approaching them on COCO, with zero-shot SAM sometimes surpassing supervised baselines by leveraging grounding and tagging models. The findings highlight foundation models as a highly promising direction for WSSS, while also identifying bottlenecks such as grounding accuracy, prompting strategies, and computational efficiency, and pointing to future work in better integration, domain generalization, and efficiency improvements.

Abstract

The rapid development of deep learning has driven significant progress in image semantic segmentation - a fundamental task in computer vision. Semantic segmentation algorithms often depend on the availability of pixel-level labels (i.e., masks of objects), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this journal, our focus is on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges of deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.

Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models

TL;DR

Weakly supervised semantic segmentation from image-level labels remains challenging due to incomplete object extents in CAMs. This paper surveys traditional WSSS methods across four taxonomy branches and evaluates the applicability of vision foundation models, notably SAM and CLIP, in both text-prompted and zero-shot settings. Empirical results show SAM-based approaches deliver superior pseudo-masks and segmentation quality, frequently rivaling fully supervised methods on VOC and approaching them on COCO, with zero-shot SAM sometimes surpassing supervised baselines by leveraging grounding and tagging models. The findings highlight foundation models as a highly promising direction for WSSS, while also identifying bottlenecks such as grounding accuracy, prompting strategies, and computational efficiency, and pointing to future work in better integration, domain generalization, and efficiency improvements.

Abstract

The rapid development of deep learning has driven significant progress in image semantic segmentation - a fundamental task in computer vision. Semantic segmentation algorithms often depend on the availability of pixel-level labels (i.e., masks of objects), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this journal, our focus is on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges of deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.
Paper Structure (39 sections, 2 equations, 6 figures, 2 tables)

This paper contains 39 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The performance of recent WSSS works (CONTA conta, IRN irn, ReCAM recam, AMN amn, LPCAM lpcam, and CLIP-ES clipes) and an evaluation of foundation models on MS COCO mscocoval set.
  • Figure 2: The pipeline of existing methods in WSSS. Based on the number of stages, they can be divided into (a) single-stage and (b) two-stage methods. Based on the models used, they can be divided into traditional methods and foundation models.
  • Figure 3: The pipeline of applying SAM in WSSS. All models except the fully-supervised segmentation model are kept frozen.
  • Figure 4: Visualization of pseudo masks generated by LPCAM lpcam, CLIP-ES clipes, SAM (text input), and SAM (zero-shot) on VOC dataset. (a) Examples showcasing high-quality masks produced by both SAM (text input) and SAM (zero-shot). (b) Examples where SAM produced masks that even surpass the quality of the ground truth masks. (c) Examples illustrating the failure cases of SAM (text input) and SAM (zero-shot).
  • Figure 5: Visualization of pseudo masks generated by LPCAM lpcam, CLIP-ES clipes, SAM (text input), and SAM (zero-shot) on MS COCO dataset. (a) Examples showcasing high-quality masks produced by SAM in complex scenes. (b) Examples illustrating the failure cases of SAM.
  • ...and 1 more figures