Table of Contents
Fetching ...

Spatial Structure Constraints for Weakly Supervised Semantic Segmentation

Tao Chen, Yazhou Yao, Xingguo Huang, Zechao Li, Liqiang Nie, Jinhui Tang

TL;DR

This work tackles weakly supervised semantic segmentation using only image-level labels by addressing the limitation of CAMs that locate only discriminative parts. It introduces Spatial Structure Constraints (SSC), combining a CAM-driven reconstruction with perceptual loss and an activation self-modulation module guided by superpixels to enforce regional consistency and refine spatial details. The training objective integrates $L = L_{cls} + \beta_p L_p + \beta_a L_a$, jointly optimizing classification, structure-preserving reconstruction, and attention alignment. Evaluations on PASCAL VOC 2012 and COCO demonstrate strong gains, achieving $72.7\%$ and $47.0\%$ mIoU respectively without external saliency models, indicating improved object localization and more complete segmentation under weak supervision.

Abstract

The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization. However, without proper constraints, the expanded activation will easily intrude into the background region. In this paper, we propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion. Specifically, we propose a CAM-driven reconstruction module to directly reconstruct the input image from deep CAM features, which constrains the diffusion of last-layer object attention by preserving the coarse spatial structure of the image content. Moreover, we propose an activation self-modulation module to refine CAMs with finer spatial structure details by enhancing regional consistency. Without external saliency models to provide background clues, our approach achieves 72.7\% and 47.0\% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively, demonstrating the superiority of our proposed approach.

Spatial Structure Constraints for Weakly Supervised Semantic Segmentation

TL;DR

This work tackles weakly supervised semantic segmentation using only image-level labels by addressing the limitation of CAMs that locate only discriminative parts. It introduces Spatial Structure Constraints (SSC), combining a CAM-driven reconstruction with perceptual loss and an activation self-modulation module guided by superpixels to enforce regional consistency and refine spatial details. The training objective integrates , jointly optimizing classification, structure-preserving reconstruction, and attention alignment. Evaluations on PASCAL VOC 2012 and COCO demonstrate strong gains, achieving and mIoU respectively without external saliency models, indicating improved object localization and more complete segmentation under weak supervision.

Abstract

The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization. However, without proper constraints, the expanded activation will easily intrude into the background region. In this paper, we propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion. Specifically, we propose a CAM-driven reconstruction module to directly reconstruct the input image from deep CAM features, which constrains the diffusion of last-layer object attention by preserving the coarse spatial structure of the image content. Moreover, we propose an activation self-modulation module to refine CAMs with finer spatial structure details by enhancing regional consistency. Without external saliency models to provide background clues, our approach achieves 72.7\% and 47.0\% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively, demonstrating the superiority of our proposed approach.
Paper Structure (17 sections, 10 equations, 8 figures, 9 tables)

This paper contains 17 sections, 10 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison between the traditional methods and ours. (a) Input image. (b) Localization maps produced by CAM zhou2016learning only identify the most discriminative part of the object, e.g., the window of a car. (c) CAM expansion kim2021discriminative results of traditional methods. They mainly focus on expanding the activation region and rely on saliency maps to provide background clues. Consequently, they will also inevitably result in over-activation, i.e., the expanded object activation intrudes into the background area. (d) Our results. Our proposed spatial structure constraints can constrain the activation within the object area to alleviate object over-activation (first and second rows) and help activate more integral object regions to mitigate object under-activation (last row). Best viewed in color.
  • Figure 2: The architecture of our proposed approach. While training the classification network with the image-level labels, we propose a CAM-driven reconstruction module to reconstruct the input image from its CAM-related features. Moreover, we propose an activation self-modulation module to further refine CAMs with finer spatial structure details through enhancing regional consistency. Our proposed modules help the classification network learn to preserve the spatial structure of the image content and constrain high activation within the object area. $\otimes$ is the Hadamard product. Best viewed in color.
  • Figure 3: Example prediction maps on the PASCAL VOC 2012 val set. For each (a) image, we show (b) the ground truth (GT), predictions of (c) baseline, (d) DRS kim2021discriminative, (e) DRS + CDR, and (f) DRS + CDR + ASM. Best viewed in color.
  • Figure 4: Example prediction maps on the COCO val set. For each (a) image, we show (b) the ground truth (GT), and (c) prediction. Best viewed in color.
  • Figure 5: Example localization maps on the PASCAL VOC 2012 training set. For each (a) image, we show (b) the ground truth (GT), localization maps produced by (c) baseline, (d) DRS kim2021discriminative, (e) DRS + CDR, and (f) DRS + CDR + ASM. Best viewed in color.
  • ...and 3 more figures