Table of Contents
Fetching ...

PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning

Mu Chen, Zhedong Zheng, Yi Yang

TL;DR

PiPa++ presents a unified self-supervised framework for unsupervised domain adaptive semantic segmentation that handles both image- and video-level domain shifts by learning intra-domain pixel- and patch-level context and enforcing temporal continuity. It introduces pixel-wise, patch-wise, and temporal contrast losses, combined with task-smart sampling and a semantic-aware memory strategy, to create a robust, parameter-efficient training objective that complements existing UDA methods. Across image and video benchmarks, PiPa++ yields significant improvements over strong baselines, including state-of-the-art results on several GTA→Cityscapes and VIPER-based tasks, while maintaining a compact deployment footprint. This approach offers a practical path toward unified domain adaptation for dense prediction tasks and paves the way for extending self-supervised, contrastive strategies to broader vision problems.

Abstract

Unsupervised domain adaptive segmentation aims to improve the segmentation accuracy of models on target domains without relying on labeled data from those domains. This approach is crucial when labeled target domain data is scarce or unavailable. It seeks to align the feature representations of the source domain (where labeled data is available) and the target domain (where only unlabeled data is present), thus enabling the model to generalize well to the target domain. Current image- and video-level domain adaptation have been addressed using different and specialized frameworks, training strategies and optimizations despite their underlying connections. In this paper, we propose a unified framework PiPa++, which leverages the core idea of ``comparing'' to (1) explicitly encourage learning of discriminative pixel-wise features with intraclass compactness and inter-class separability, (2) promote the robust feature learning of the identical patch against different contexts or fluctuations, and (3) enable the learning of temporal continuity under dynamic environments. With the designed task-smart contrastive sampling strategy, PiPa++ enables the mining of more informative training samples according to the task demand. Extensive experiments demonstrate the effectiveness of our method on both image-level and video-level domain adaption benchmarks. Moreover, the proposed method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.

PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning

TL;DR

PiPa++ presents a unified self-supervised framework for unsupervised domain adaptive semantic segmentation that handles both image- and video-level domain shifts by learning intra-domain pixel- and patch-level context and enforcing temporal continuity. It introduces pixel-wise, patch-wise, and temporal contrast losses, combined with task-smart sampling and a semantic-aware memory strategy, to create a robust, parameter-efficient training objective that complements existing UDA methods. Across image and video benchmarks, PiPa++ yields significant improvements over strong baselines, including state-of-the-art results on several GTA→Cityscapes and VIPER-based tasks, while maintaining a compact deployment footprint. This approach offers a practical path toward unified domain adaptation for dense prediction tasks and paves the way for extending self-supervised, contrastive strategies to broader vision problems.

Abstract

Unsupervised domain adaptive segmentation aims to improve the segmentation accuracy of models on target domains without relying on labeled data from those domains. This approach is crucial when labeled target domain data is scarce or unavailable. It seeks to align the feature representations of the source domain (where labeled data is available) and the target domain (where only unlabeled data is present), thus enabling the model to generalize well to the target domain. Current image- and video-level domain adaptation have been addressed using different and specialized frameworks, training strategies and optimizations despite their underlying connections. In this paper, we propose a unified framework PiPa++, which leverages the core idea of ``comparing'' to (1) explicitly encourage learning of discriminative pixel-wise features with intraclass compactness and inter-class separability, (2) promote the robust feature learning of the identical patch against different contexts or fluctuations, and (3) enable the learning of temporal continuity under dynamic environments. With the designed task-smart contrastive sampling strategy, PiPa++ enables the mining of more informative training samples according to the task demand. Extensive experiments demonstrate the effectiveness of our method on both image-level and video-level domain adaption benchmarks. Moreover, the proposed method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.
Paper Structure (11 sections, 7 equations, 5 figures, 13 tables)

This paper contains 11 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Different from existing works, we focus on mining the intra-domain knowledge, and argue that the contextual structure between pixels and patches can facilitate the model learning the domain-invariant knowledge in a self-supervised manner. In addition, we exploit the change of contextual structures across in a video and maintain contextual consistency across frames, thereby achieving temporal continuity. Our design embodies efficiency and flexibility, making it perfectly suited for both image-based and video-based domain adaptation tasks, encompassing four competitive datasets: GTA richter2016playing$\rightarrow$ Cityscapes MariusCordts2016TheCD, SYNTHIA GermanRos2016TheSD$\rightarrow$ Cityscapes, SYNTHIA-Seq $\rightarrow$ Cityscapes-Seq, and VIPER richter2017playing$\rightarrow$ Cityscapes-Seq. Albeit simple, the proposed learning method is compatible with other existing methods to further boost performance.
  • Figure 2: A brief illustration of our unified multi-grained self-supervised learning Framework (PiPa). Given the labeled source data $\left\{\left(x^{S}, y^{S}\right)\right\}$, we calculate the segmentation prediction $\hat{y}^S$ with the backbone $g_\theta$ and the classification head $h_{cls}$, supervised by the basic segmentation loss $L_{ce}^S$. During training, we leverage the moving averaged model $\boldsymbol{g}_{\bar{\theta}}$ to estimate the pseudo label $\bar{y}^T$ to craft the mixed label $\bar{y}^{Mix}$ based on the category. According to the mixed label, we copy the corresponding regions as the mixed data $x^{Mix}$. We also deploy the model $g_\theta$ and the head $h_{cls}$ to obtain the mixed prediction $\hat{y}^{Mix}$ supervised by $L_{ce}^T$. Except for the above-mentioned basic segmentation losses, we revisit current pixel contrast and propose a unified multi-grained Contrast. In (a), we regularize the pixel embedding space by computing pixel-to-pixel contrast: impelling positive-pair embeddings closer, and pushing away the negative embeddings. In (b), we regularize the patch-wise consistency between projected patch $\mathbf{O}_1$ and $\mathbf{O}_2$. Similarly, we harness the patch-wise contrast, which pulls positive pair, i.e., two features at the same location of $\mathbf{O}_1$ and $\mathbf{O}_2$ closer, while pushing negative pairs apart, i.e., any two features in $\mathbf{M}_1 \cup \mathbf{M}_2$ at different locations. During inference, we drop the two projection heads $h_{patch}$ and $h_{pixel}$ and only keep $g_\theta$ and $h_{cls}$.
  • Figure 3: Our proposed training framework: (a) motivates intra-class compactness and inter-class dispersion by pulling closer the pixel-wise intra-class features and pushing away inter-class features within the image (see a&b at the top row); and (b) maintains the local patch consistency against different contexts, such as the yellow local patch in the green and the blue patch (see the middle row (b)). (c) aggregates temporal continuity through contrastive leraning across frames.
  • Figure 4: Qualitative results on GTA $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes. From left to right: Target Image, Ground Truth, the visual results predicted by DAFormer, DAFormer + Ours (PiPa), HRDA, HRDA + Ours (PiPa). We deploy the white dash boxes to highlight different prediction parts.
  • Figure 5: Qualitative results on VIPER $\rightarrow$ Cityscapes-Seq and SYNTHIA-Seq $\rightarrow$ Cityscapes-Seq. From left to right: Target Image, Ground Truth, the visual results predicted by DAFormer, DAFormer + Ours (PiPa), HRDA, HRDA + Ours (PiPa). We deploy the white dash boxes to highlight different prediction parts.