Table of Contents
Fetching ...

Pixel-Level Domain Adaptation: A New Perspective for Enhancing Weakly Supervised Semantic Segmentation

Ye Du, Zehua Fu, Qingjie Liu

TL;DR

This work tackles weakly supervised semantic segmentation by addressing CAM imbalanced activation through Pixel-Level Domain Adaptation (PLDA), which aligns discriminative and non-discriminative object regions at the pixel level using a multi-head domain classifier and adversarial training. It further introduces MaskAssign to dynamically assign pixels to source or target domains and Confident Pseudo-Supervision to preserve per-pixel discriminability during learning. The combined objective L_total = L_cls + L_uda + L_cps achieves consistent improvements on VOC 2012 and COCO 2014 across multiple baselines, both for CAM seed quality and downstream segmentation, demonstrating strong generalization and practical impact. The approach offers a simple, integrate-with-existing-methods framework that enhances CAM completeness and pseudo-label quality, enabling more accurate weakly supervised segmentation and potential extension to other CAM-driven tasks.

Abstract

Recent attention has been devoted to the pursuit of learning semantic segmentation models exclusively from image tags, a paradigm known as image-level Weakly Supervised Semantic Segmentation (WSSS). Existing attempts adopt the Class Activation Maps (CAMs) as priors to mine object regions yet observe the imbalanced activation issue, where only the most discriminative object parts are located. In this paper, we argue that the distribution discrepancy between the discriminative and the non-discriminative parts of objects prevents the model from producing complete and precise pseudo masks as ground truths. For this purpose, we propose a Pixel-Level Domain Adaptation (PLDA) method to encourage the model in learning pixel-wise domain-invariant features. Specifically, a multi-head domain classifier trained adversarially with the feature extraction is introduced to promote the emergence of pixel features that are invariant with respect to the shift between the source (i.e., the discriminative object parts) and the target (\textit{i.e.}, the non-discriminative object parts) domains. In addition, we come up with a Confident Pseudo-Supervision strategy to guarantee the discriminative ability of each pixel for the segmentation task, which serves as a complement to the intra-image domain adversarial training. Our method is conceptually simple, intuitive and can be easily integrated into existing WSSS methods. Taking several strong baseline models as instances, we experimentally demonstrate the effectiveness of our approach under a wide range of settings.

Pixel-Level Domain Adaptation: A New Perspective for Enhancing Weakly Supervised Semantic Segmentation

TL;DR

This work tackles weakly supervised semantic segmentation by addressing CAM imbalanced activation through Pixel-Level Domain Adaptation (PLDA), which aligns discriminative and non-discriminative object regions at the pixel level using a multi-head domain classifier and adversarial training. It further introduces MaskAssign to dynamically assign pixels to source or target domains and Confident Pseudo-Supervision to preserve per-pixel discriminability during learning. The combined objective L_total = L_cls + L_uda + L_cps achieves consistent improvements on VOC 2012 and COCO 2014 across multiple baselines, both for CAM seed quality and downstream segmentation, demonstrating strong generalization and practical impact. The approach offers a simple, integrate-with-existing-methods framework that enhances CAM completeness and pseudo-label quality, enabling more accurate weakly supervised segmentation and potential extension to other CAM-driven tasks.

Abstract

Recent attention has been devoted to the pursuit of learning semantic segmentation models exclusively from image tags, a paradigm known as image-level Weakly Supervised Semantic Segmentation (WSSS). Existing attempts adopt the Class Activation Maps (CAMs) as priors to mine object regions yet observe the imbalanced activation issue, where only the most discriminative object parts are located. In this paper, we argue that the distribution discrepancy between the discriminative and the non-discriminative parts of objects prevents the model from producing complete and precise pseudo masks as ground truths. For this purpose, we propose a Pixel-Level Domain Adaptation (PLDA) method to encourage the model in learning pixel-wise domain-invariant features. Specifically, a multi-head domain classifier trained adversarially with the feature extraction is introduced to promote the emergence of pixel features that are invariant with respect to the shift between the source (i.e., the discriminative object parts) and the target (\textit{i.e.}, the non-discriminative object parts) domains. In addition, we come up with a Confident Pseudo-Supervision strategy to guarantee the discriminative ability of each pixel for the segmentation task, which serves as a complement to the intra-image domain adversarial training. Our method is conceptually simple, intuitive and can be easily integrated into existing WSSS methods. Taking several strong baseline models as instances, we experimentally demonstrate the effectiveness of our approach under a wide range of settings.
Paper Structure (38 sections, 9 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 38 sections, 9 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the motivation. The presence of a distribution discrepancy between the discriminative elements (e.g., the cat's head) and the non-discriminative components (e.g., the cat's body) of objects leads to a distinct activation pattern within the class activation map. To counteract this challenge, this paper proposes to explicitly align the pixel features of both types of regions in the image.
  • Figure 2: Illustration of the distribution discrepancy between the most discriminative and less-discriminative regions. The distribution is depicted by plotting the similarity scores between each pixel and its corresponding class-centroid, which is obtained by averaging all pixel features within the class. We denote the most discriminative regions as the "source" and the others as the "target". We use the PASCAL VOC 2012 everingham2010pascal validation set that consists of $1,449$ images for plotting. To ensure an equal number of pixels in both domains, we sample $256$ pixels per class.
  • Figure 3: Illustration of the PLDA framework. Our method utilizes a domain classifier $g_{\phi}$ that trained adversarially with the feature extractor $f_{\theta}$ to learn domain invariant features between the source and target domains. The source and target domain pixels are assigned dynamically according to CAM values. Additionally, we employ a confident pseudo-supervision mechanism on pixels associated with the source (or target) domain to ensure the essential discriminability for successful segmentation tasks, which operates in harmony with the domain adversarial training. "GRL" means gradient reversal layer.
  • Figure 4: Comparison of the two alignment ways. (a) Global method aligns the marginal distributions without considering the per-class information. (b) Category-wise method aligns the class-conditional distributions to reduce the mismatching between classes. "$+$" and "$-$" indicate two categories. Green and blue colors indicate samples of source and target domains, respectively.
  • Figure 5: Illustration of the multi-head domain classifier. The $g_{\phi}$ is instantiated with a shared base network and $C$ classifier heads, where each head (e.g., $h_2$ in this instance) distinguishes source and target pixels for a specific category. This simple design avoids the misalignment across classes.
  • ...and 5 more figures