Table of Contents
Fetching ...

Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation

Xinyu Yang, Hossein Rahmani, Sue Black, Bryan M. Williams

TL;DR

This paper tackles the challenge of weakly supervised semantic segmentation where CAMs used for pseudo-labels are unstable and often rely on offline refinements. It introduces Co-training with Swapping Assignments (CoSA), a fully end-to-end dual-stream framework with an online network and an assignment network that swap CAM-based and segmentation pseudo-labels to mutually supervise each other. Key contributions include guiding CAMs with segmentation pseudo-labels (SPL), reliability-aware weighting of pseudo-labels (RAW), dynamic thresholding to adapt to learning, and a contrastive separation loss to mitigate CAM coexistence. The approach achieves state-of-the-art results on VOC and COCO, reduces or eliminates the need for post-hoc refinements like CRFs, and demonstrates strong training efficiency, signaling a practical, single-stage alternative to multi-stage WSSS pipelines.

Abstract

Class activation maps (CAMs) are commonly employed in weakly supervised semantic segmentation (WSSS) to produce pseudo-labels. Due to incomplete or excessive class activation, existing studies often resort to offline CAM refinement, introducing additional stages or proposing offline modules. This can cause optimization difficulties for single-stage methods and limit generalizability. In this study, we aim to reduce the observed CAM inconsistency and error to mitigate reliance on refinement processes. We propose an end-to-end WSSS model incorporating guided CAMs, wherein our segmentation model is trained while concurrently optimizing CAMs online. Our method, Co-training with Swapping Assignments (CoSA), leverages a dual-stream framework, where one sub-network learns from the swapped assignments generated by the other. We introduce three techniques: i) soft perplexity-based regularization to penalize uncertain regions; ii) a threshold-searching approach to dynamically revise the confidence threshold; and iii) contrastive separation to address the coexistence problem. CoSA demonstrates exceptional performance, achieving mIoU of 76.2\% and 51.0\% on VOC and COCO validation datasets, respectively, surpassing existing baselines by a substantial margin. Notably, CoSA is the first single-stage approach to outperform all existing multi-stage methods including those with additional supervision. Code is avilable at \url{https://github.com/youshyee/CoSA}.

Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation

TL;DR

This paper tackles the challenge of weakly supervised semantic segmentation where CAMs used for pseudo-labels are unstable and often rely on offline refinements. It introduces Co-training with Swapping Assignments (CoSA), a fully end-to-end dual-stream framework with an online network and an assignment network that swap CAM-based and segmentation pseudo-labels to mutually supervise each other. Key contributions include guiding CAMs with segmentation pseudo-labels (SPL), reliability-aware weighting of pseudo-labels (RAW), dynamic thresholding to adapt to learning, and a contrastive separation loss to mitigate CAM coexistence. The approach achieves state-of-the-art results on VOC and COCO, reduces or eliminates the need for post-hoc refinements like CRFs, and demonstrates strong training efficiency, signaling a practical, single-stage alternative to multi-stage WSSS pipelines.

Abstract

Class activation maps (CAMs) are commonly employed in weakly supervised semantic segmentation (WSSS) to produce pseudo-labels. Due to incomplete or excessive class activation, existing studies often resort to offline CAM refinement, introducing additional stages or proposing offline modules. This can cause optimization difficulties for single-stage methods and limit generalizability. In this study, we aim to reduce the observed CAM inconsistency and error to mitigate reliance on refinement processes. We propose an end-to-end WSSS model incorporating guided CAMs, wherein our segmentation model is trained while concurrently optimizing CAMs online. Our method, Co-training with Swapping Assignments (CoSA), leverages a dual-stream framework, where one sub-network learns from the swapped assignments generated by the other. We introduce three techniques: i) soft perplexity-based regularization to penalize uncertain regions; ii) a threshold-searching approach to dynamically revise the confidence threshold; and iii) contrastive separation to address the coexistence problem. CoSA demonstrates exceptional performance, achieving mIoU of 76.2\% and 51.0\% on VOC and COCO validation datasets, respectively, surpassing existing baselines by a substantial margin. Notably, CoSA is the first single-stage approach to outperform all existing multi-stage methods including those with additional supervision. Code is avilable at \url{https://github.com/youshyee/CoSA}.
Paper Structure (19 sections, 11 equations, 17 figures, 14 tables, 1 algorithm)

This paper contains 19 sections, 11 equations, 17 figures, 14 tables, 1 algorithm.

Figures (17)

  • Figure 1: Oracle Experiments on VOC. CAMs are guided by the ground truth (GT), proposed segmentation pseudo-labels (SPL), no guidance (NO) and random noise (NS). (a): classification performance; (b): CAM quality; (c) CAM visualization. All experiments employ 2k-iters warm-up before guidance is introduced.
  • Figure 2: Co-training with Swapping Assignments (CoSA). We propose an end-to-end dual-stream weakly-supervised segmentation framework, capable of co-optimizing the segmentation prediction and CAMs by leveraging the swapped assignments, namely CAM pseudo-labels (CPL) and segmentation pseudo-labels (SPL). Our framework comprises two networks: an assignment network (AN) and an online network (ON), where the AN is responsible for generating pseudo-labels for training the ON. While the AN has identical architecture to the ON, it is updated through exponential moving average (EMA) of the ON. The diagram on the right provides an illustration of the architecture. Given weak-augmented images as input, the AN produces CPL to supervise segmentation in the ON ($\mathcal{L}_{\text{c2s}}$). During training, the CPL is softened by reliability-based adaptive weighting (RAW), formed based on CAM perplexity estimation and dynamic thresholding. The AN also generates SPL which is utilized to supervise the CAMs ($\mathcal{L}_{\text{s2c}}$). Further, the CAMs are regularized to contrastively separate the foreground from the background regions ($\mathcal{L}_{\text{csc}}$). Note that the ON is also trained for classification using the image-level class labels ($\mathcal{L}_{\text{cls}}$).
  • Figure 3: CPL Analysis(a): heatmap of CPL accuracy vs. confident ranges (x-axis) for different time-steps (y-axis) for VOC and COCO. (b): correlation between perplexity and accuracy of CPL for different time-steps. (c): distribution of CAMs' confidence categorized by the proposed dynamic threshold on VOC. See Supp. for COCO analysis.
  • Figure 4: Illustration of Coexistence Problem in CAMs. The first row shows the input images. The second row shows the coexistence problem e.g. 'bird' with 'branches', 'train' with 'railways' and 'boat' with 'the sea'.
  • Figure 5: $\mathcal{M}$ and $\mathcal{M}^\dagger$ Comparisons.(a): mIoU vs. time-steps for $\mathcal{M}$ and $\mathcal{M}^\dagger$ on VOC val. (b): same as (a) but filtered by perplexity. (c): cases of coexistence in $\mathcal{M}$ but not in $\mathcal{M}^\dagger$.
  • ...and 12 more figures