Table of Contents
Fetching ...

Rethinking Saliency-Guided Weakly-Supervised Semantic Segmentation

Beomyoung Kim, Donghyun Kim, Sung Ju Hwang

TL;DR

This work reframes the role of saliency maps in image-level weakly-supervised semantic segmentation by showing that saliency map quality and the threshold used to convert activation maps into pseudo labels are critical yet underexplored. It demonstrates consistent, large performance variations across methods when different saliency maps are used, arguing that lack of standardization hampers fair comparisons. To address this, the authors introduce WSSS-BED, a unified framework that provides diverse saliency/activation maps and even unsupervised SOD outputs to enable controlled, reproducible experiments across seven WSSS methods. Empirically, high-quality saliency maps (e.g., from large SOD datasets like DUTS or COCO-derived masks) can boost WSSS performance toward or beyond state-of-the-art, while CAM can still be highly competitive with proper $\tau$ tuning, underscoring the importance of threshold design and saliency integration in practice.

Abstract

This paper presents a fresh perspective on the role of saliency maps in weakly-supervised semantic segmentation (WSSS) and offers new insights and research directions based on our empirical findings. We conduct comprehensive experiments and observe that the quality of the saliency map is a critical factor in saliency-guided WSSS approaches. Nonetheless, we find that the saliency maps used in previous works are often arbitrarily chosen, despite their significant impact on WSSS. Additionally, we observe that the choice of the threshold, which has received less attention before, is non-trivial in WSSS. To facilitate more meaningful and rigorous research for saliency-guided WSSS, we introduce \texttt{WSSS-BED}, a standardized framework for conducting research under unified conditions. \texttt{WSSS-BED} provides various saliency maps and activation maps for seven WSSS methods, as well as saliency maps from unsupervised salient object detection models.

Rethinking Saliency-Guided Weakly-Supervised Semantic Segmentation

TL;DR

This work reframes the role of saliency maps in image-level weakly-supervised semantic segmentation by showing that saliency map quality and the threshold used to convert activation maps into pseudo labels are critical yet underexplored. It demonstrates consistent, large performance variations across methods when different saliency maps are used, arguing that lack of standardization hampers fair comparisons. To address this, the authors introduce WSSS-BED, a unified framework that provides diverse saliency/activation maps and even unsupervised SOD outputs to enable controlled, reproducible experiments across seven WSSS methods. Empirically, high-quality saliency maps (e.g., from large SOD datasets like DUTS or COCO-derived masks) can boost WSSS performance toward or beyond state-of-the-art, while CAM can still be highly competitive with proper tuning, underscoring the importance of threshold design and saliency integration in practice.

Abstract

This paper presents a fresh perspective on the role of saliency maps in weakly-supervised semantic segmentation (WSSS) and offers new insights and research directions based on our empirical findings. We conduct comprehensive experiments and observe that the quality of the saliency map is a critical factor in saliency-guided WSSS approaches. Nonetheless, we find that the saliency maps used in previous works are often arbitrarily chosen, despite their significant impact on WSSS. Additionally, we observe that the choice of the threshold, which has received less attention before, is non-trivial in WSSS. To facilitate more meaningful and rigorous research for saliency-guided WSSS, we introduce \texttt{WSSS-BED}, a standardized framework for conducting research under unified conditions. \texttt{WSSS-BED} provides various saliency maps and activation maps for seven WSSS methods, as well as saliency maps from unsupervised salient object detection models.
Paper Structure (42 sections, 2 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 42 sections, 2 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Performance variation according to the saliency map. We reveal the saliency map used in each method is not unified, and the impact of the saliency map on each method is highly significant. 'Sal METHOD' on the x-axis denotes saliency maps used in METHOD. The scores are measured on VOC 2012 validation set.
  • Figure 2: Qualitative samples of pseudo segmentation labels (even rows) given saliency maps (odd rows) and activation maps. 'Sal METHOD (SOD-MODEL)' indicates the saliency map used in WSSS METHOD and the pre-trained SOD-MODEL is employed when generating the saliency map. The saliency map is crucial in determining the quality of the pseudo label. Although some methods ($e.g.,$ DRS (DRS)kim2021discriminative, EDAM (EDAM)wu2021embedded, L2G (L2G)jiang2022l2g) employed the same SOD-MODEL ($i.e.,$ PoolNet (PoolNet)liu2019simple), the quality of the saliency map is different greatly.
  • Figure 3: Qualitative comparisons for activation maps (odd rows) and pseudo labels (even rows) of WSSS methods using the same saliency map. The activation map from CAM (CAM)zhou2016learning appears to highlight the object region sparsely compared to other methods, but the quality of the pseudo label from CAM is highly competitive with other methods when using a lower threshold. Since the quality of the activation map of each method varies largely, the threshold is required to be differently set for each method.
  • Figure 4: Quantitative comparisons of WSSS methods according to the threshold $\tau$ given equivalent saliency maps. The impact of the threshold on WSSS methods is highly substantial, and the optimal threshold varies from each method. The conventional CAM with a low threshold shows a highly competitive performance compared with state-of-the-art methods.
  • Figure 5: Qualitative samples for saliency maps and pseudo labels according to the SOD model and dataset. 'SOD-MODEL (SOD-DATASET)' denotes the saliency map is generated from SOD-MODEL that is pre-trained on SOD-DATASET. When using the larger-scale dataset ($e.g.,$ DUTS (DUTS)wang2017learning) or the more powerful model ($e.g.,$ VST (VST)liu2021visual), the quality of the generated saliency map improves accordingly, resulting in the higher-quality pseudo label.
  • ...and 2 more figures