PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning
Mu Chen, Zhedong Zheng, Yi Yang
TL;DR
PiPa++ presents a unified self-supervised framework for unsupervised domain adaptive semantic segmentation that handles both image- and video-level domain shifts by learning intra-domain pixel- and patch-level context and enforcing temporal continuity. It introduces pixel-wise, patch-wise, and temporal contrast losses, combined with task-smart sampling and a semantic-aware memory strategy, to create a robust, parameter-efficient training objective that complements existing UDA methods. Across image and video benchmarks, PiPa++ yields significant improvements over strong baselines, including state-of-the-art results on several GTA→Cityscapes and VIPER-based tasks, while maintaining a compact deployment footprint. This approach offers a practical path toward unified domain adaptation for dense prediction tasks and paves the way for extending self-supervised, contrastive strategies to broader vision problems.
Abstract
Unsupervised domain adaptive segmentation aims to improve the segmentation accuracy of models on target domains without relying on labeled data from those domains. This approach is crucial when labeled target domain data is scarce or unavailable. It seeks to align the feature representations of the source domain (where labeled data is available) and the target domain (where only unlabeled data is present), thus enabling the model to generalize well to the target domain. Current image- and video-level domain adaptation have been addressed using different and specialized frameworks, training strategies and optimizations despite their underlying connections. In this paper, we propose a unified framework PiPa++, which leverages the core idea of ``comparing'' to (1) explicitly encourage learning of discriminative pixel-wise features with intraclass compactness and inter-class separability, (2) promote the robust feature learning of the identical patch against different contexts or fluctuations, and (3) enable the learning of temporal continuity under dynamic environments. With the designed task-smart contrastive sampling strategy, PiPa++ enables the mining of more informative training samples according to the task demand. Extensive experiments demonstrate the effectiveness of our method on both image-level and video-level domain adaption benchmarks. Moreover, the proposed method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.
