Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models
Zhaozheng Chen, Qianru Sun
TL;DR
Weakly supervised semantic segmentation from image-level labels remains challenging due to incomplete object extents in CAMs. This paper surveys traditional WSSS methods across four taxonomy branches and evaluates the applicability of vision foundation models, notably SAM and CLIP, in both text-prompted and zero-shot settings. Empirical results show SAM-based approaches deliver superior pseudo-masks and segmentation quality, frequently rivaling fully supervised methods on VOC and approaching them on COCO, with zero-shot SAM sometimes surpassing supervised baselines by leveraging grounding and tagging models. The findings highlight foundation models as a highly promising direction for WSSS, while also identifying bottlenecks such as grounding accuracy, prompting strategies, and computational efficiency, and pointing to future work in better integration, domain generalization, and efficiency improvements.
Abstract
The rapid development of deep learning has driven significant progress in image semantic segmentation - a fundamental task in computer vision. Semantic segmentation algorithms often depend on the availability of pixel-level labels (i.e., masks of objects), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this journal, our focus is on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges of deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.
