Table of Contents
Fetching ...

Fine-grained Background Representation for Weakly Supervised Semantic Segmentation

Xu Yin, Woobin Im, Dongbo Min, Yuchi Huo, Fei Pan, Sung-Eui Yoon

TL;DR

A simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems, and demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks.

Abstract

Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon using the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our proposed method can be seamlessly plugged into various models, yielding new state-of-the-art results under various WSSS settings across benchmarks. Leveraging solely image-level (I) labels as supervision, our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively. Furthermore, by incorporating saliency maps as an additional supervision signal (I+S), we attain 74.9 mIoU on Pascal Voc test set. Concurrently, our FBR approach demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks, showcasing its robustness and strong generalization capabilities across diverse domains.

Fine-grained Background Representation for Weakly Supervised Semantic Segmentation

TL;DR

A simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems, and demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks.

Abstract

Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon using the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our proposed method can be seamlessly plugged into various models, yielding new state-of-the-art results under various WSSS settings across benchmarks. Leveraging solely image-level (I) labels as supervision, our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively. Furthermore, by incorporating saliency maps as an additional supervision signal (I+S), we attain 74.9 mIoU on Pascal Voc test set. Concurrently, our FBR approach demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks, showcasing its robustness and strong generalization capabilities across diverse domains.
Paper Structure (17 sections, 11 equations, 17 figures, 15 tables)

This paper contains 17 sections, 11 equations, 17 figures, 15 tables.

Figures (17)

  • Figure 1: (a) Input (b) Class activation maps via AMN AMN (c) Refined class activation maps with our method (on AMN). In the 1st row (b), class activation maps mistake the lake (co-occurring background semantic) as the boat; in the 2nd row (b), the horse is not completely activated.
  • Figure 2: Architecture overview. A standard feature encoder trained with the classification loss $L_{cls}$ (with TAP tap) takes an input image $\mathbf{x}$ and generates the seed $\mathcal{H}$. We consider that image BG has a different semantic granularity from FG and add two projection heads, $\varphi_{fg}$ and $\varphi_{bg}$, model BG independently from FG to capture diverse BG information, and optimize two contrastive relationships: (1) fore-to-background and (2) intra-foreground. (1) enhances the semantic features $f$ in representing BG semantics with the proposed fine-grained primitive, namely NROIs. We compute FG prototypes and store NROIs in a memory bank. Besides, the auxiliary BG segmentation loss $L_{seg}$ is introduced. In (2), we present an active sampling strategy built upon the semantic graph to draw the FG negatives. The contrastive losses $L_{pcl}^{bg}$ for (1) and $L_{pcl}^{fg}$ for (2) pull the query closer to its prototype but push far from the FG and the BG negative keys, respectively.
  • Figure 3: Conceptual illustration of negative-region-of-interest (NROI) for the FB contrast. The brute-force strategy (a) exhaustively compares FG queries (the red cropped part) with all BG pixels (triangles), which requires expensive computational resources and is susceptible to implausible labels. By contrast (b), we propose recognizing the fine-grained BG semantic, i.e., NROI. This example's NROIs (marked with different colors) contain the washing machine, closet, etc. In training, we implement FB contrastive learning by comparing queries (the red rectangle) against NROIs.
  • Figure 4: Example results of CAMs on Pascal Voc 2012 train set. From left to right: input images, results of AMN, results of AMN w/ ours, results of PPC, results of PPC w/ ours and the ground truth. The red boxes highlight the refined details.
  • Figure 5: Qualitative semantic segmentation results. The left figures are results from Pascal Voc 2012 val set, and the right ones are from MS COCO 2014 val set. (a) Input images, (b) Ours, (c) Ground truth.
  • ...and 12 more figures