Table of Contents
Fetching ...

Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation

Xinliang Zhang, Lei Zhu, Shuang Zeng, Hangzhou He, Ourui Fu, Zhengjian Yao, Zhaoheng Xie, Yanye Lu

TL;DR

This work tackles scribble-based weakly supervised semantic segmentation by introducing CSPNet, which leverages inherent class labels to generate robust pseudo-labels without overreliance on noisy scribble-driven predictions. Key innovations include the localization rectification module (LoRM) for rectifying misled foreground representations and the distance perception module (DPM) to identify reliable regions around scribble and pseudo-label boundaries. A dedicated scribble simulation algorithm and two large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, enable robust evaluation across diverse scribble styles. Empirical results show state-of-the-art performance and strong robustness to scribble variability, with public release of code and datasets ahead of broader adoption in the SSSS community.

Abstract

Scribble-based weakly supervised semantic segmentation leverages only a few annotated pixels as labels to train a segmentation model, presenting significant potential for reducing the human labor involved in the annotation process. This approach faces two primary challenges: first, the sparsity of scribble annotations can lead to inconsistent predictions due to limited supervision; second, the variability in scribble annotations, reflecting differing human annotator preferences, can prevent the model from consistently capturing the discriminative regions of objects, potentially leading to unstable predictions. To address these issues, we propose a holistic framework, the class-driven scribble promotion network, for robust scribble-supervised semantic segmentation. This framework not only utilizes the provided scribble annotations but also leverages their associated class labels to generate reliable pseudo-labels. Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo-labels. In addition, we introduce new large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, accompanied by a scribble simulation algorithm that enables evaluation across varying scribble styles. Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches. The datasets and the codes will be made publicly available.

Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation

TL;DR

This work tackles scribble-based weakly supervised semantic segmentation by introducing CSPNet, which leverages inherent class labels to generate robust pseudo-labels without overreliance on noisy scribble-driven predictions. Key innovations include the localization rectification module (LoRM) for rectifying misled foreground representations and the distance perception module (DPM) to identify reliable regions around scribble and pseudo-label boundaries. A dedicated scribble simulation algorithm and two large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, enable robust evaluation across diverse scribble styles. Empirical results show state-of-the-art performance and strong robustness to scribble variability, with public release of code and datasets ahead of broader adoption in the SSSS community.

Abstract

Scribble-based weakly supervised semantic segmentation leverages only a few annotated pixels as labels to train a segmentation model, presenting significant potential for reducing the human labor involved in the annotation process. This approach faces two primary challenges: first, the sparsity of scribble annotations can lead to inconsistent predictions due to limited supervision; second, the variability in scribble annotations, reflecting differing human annotator preferences, can prevent the model from consistently capturing the discriminative regions of objects, potentially leading to unstable predictions. To address these issues, we propose a holistic framework, the class-driven scribble promotion network, for robust scribble-supervised semantic segmentation. This framework not only utilizes the provided scribble annotations but also leverages their associated class labels to generate reliable pseudo-labels. Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo-labels. In addition, we introduce new large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, accompanied by a scribble simulation algorithm that enables evaluation across varying scribble styles. Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches. The datasets and the codes will be made publicly available.

Paper Structure

This paper contains 32 sections, 21 equations, 25 figures, 12 tables, 1 algorithm.

Figures (25)

  • Figure 1: Scribble-supervised semantic segmentation presents two main challenges. While the issue of sparse supervision can be addressed by introducing pseudo-labels, the robustness challenge posed by the variability of scribble annotations remains to be investigated. In order to mitigate this issue, we proposed to generate the pseudo label via the inherent class label in scribble as presented in (d), which is different from previous methods (a)-(c). $P$ represents the model's prediction, $P'$ represents the pseudo label, the dotted line represents the supervision.
  • Figure 2: An overview of our method (CSPNet). In the first stage, taking ToCo as the basic network, we train the model with both image-level class supervision and the pixel-level scribble supervision. After training, we generate the pseudo-label from the classification branch, which will be used to train the semantic segmentation model in the second stage. In the second stage, we train the DeeplabV3+ with the basic supervision, distance perception module, and localization rectification module, adopting both scribble and the pseudo-label as the supervision.
  • Figure 3: Visualization results employing resnet50 backbone and deeplabV2 segmentor. (a) is the original image with scribble label, (b) is the pseudo-label for training, (c) is the prediction trained with $\mathcal{L}_{basic}$, (d) is the prediction trained with $\mathcal{L}_{basic}+\mathcal{L}_{lorm}$. (e) is the ground truth label.
  • Figure 4: The workflow of the localization rectification module. "Matmul" is short for the matrix multiplication. It takes the feature map $\mathbf{F}$ and the foreground pseudo mask $\mathbf{M}$ as inputs, and outputs the loss between the rectified feature map and the original feature map.
  • Figure 5: Feature map visualization of the last layer of resnet50 bachbone with deeplabV2 segmentator. (a) is the pseudo label and its corresponding foreground mask. (b) is the prediction without LoRM and its corresponding heat map. (c) is the prediction with LoRM and its corresponding heat map.
  • ...and 20 more figures