MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

Fei Pan; Xu Yin; Seokju Lee; Axi Niu; Sungeui Yoon; In So Kweon

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

Fei Pan, Xu Yin, Seokju Lee, Axi Niu, Sungeui Yoon, In So Kweon

TL;DR

MoDA tackles unsupervised domain adaptation for semantic segmentation when the target domain provides unlabeled video frames. It leverages self-supervised object motion cues learned from target videos and uses them through two modules—Object Discovery Module and Semantic Mining Module—to refine pseudo labels and improve self-training. The approach disentangles object motion from ego-motion via geometric constraints and demonstrates superior performance against optical-flow baselines in both domain-adaptive video and image segmentation, while remaining compatible with other UDA methods. This motion-guided framework offers a practical way to exploit unlabeled video data for domain adaptation in segmentation.

Abstract

Unsupervised domain adaptation (UDA) has been a potent technique to handle the lack of annotations in the target domain, particularly in semantic segmentation task. This study introduces a different UDA scenarios where the target domain contains unlabeled video frames. Drawing upon recent advancements of self-supervised learning of the object motion from unlabeled videos with geometric constraint, we design a \textbf{Mo}tion-guided \textbf{D}omain \textbf{A}daptive semantic segmentation framework (MoDA). MoDA harnesses the self-supervised object motion cues to facilitate cross-domain alignment for segmentation task. First, we present an object discovery module to localize and segment target moving objects using object motion information. Then, we propose a semantic mining module that takes the object masks to refine the pseudo labels in the target domain. Subsequently, these high-quality pseudo labels are used in the self-training loop to bridge the cross-domain gap. On domain adaptive video and image segmentation experiments, MoDA shows the effectiveness utilizing object motion as guidance for domain alignment compared with optical flow information. Moreover, MoDA exhibits versatility as it can complement existing state-of-the-art UDA approaches. Code at https://github.com/feipanir/MoDA.

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

TL;DR

Abstract

Paper Structure (14 sections, 13 equations, 9 figures, 4 tables)

This paper contains 14 sections, 13 equations, 9 figures, 4 tables.

Introduction
Related Works
Preliminary
Target Domain Geometric Learning
Motion Mask Preprocessing
Methodology
Segmentation Warm-up
Target Object Discovery
Target Semantic Mining
Experiments
Experiment Setup
Evaluation Results
Ablation Study
Conclusion

Figures (9)

Figure 1: Current UDA methods show notable performance for background categories (e.g., tree, road, sky), but they are limited to the real-world dynamic scenes containing multiple moving objects (e.g., buses). We propose MoDA that uses object motion from unlabeled videos as complementary guidance to refine pseudo labels (PL) in the target domain. Note that the object motion is learned using using self-supervised geometric constraints from sequential video frames ($t_1$ and $t_2$), without requiring any annotations.
Figure 2: The object motion information is learned by self-supervised geometric constraints from unlabeled target video frames, without any annotations. (a) The visualization of a bird's eye view of the dynamic scene, where the yellow bus is moving toward the camera. We indicate the object motion of the yellow bus and the ego motion from the camera itself. (b) The diagram for geometric training to learn the object motion from a pair of target adjacent video frames. The motion network and depth network are trained by the self-supervised losses (photometric loss and regularization losses) following geometric constraints.
Figure 3: (a) The motion mask extracted from the object motion map include multiple moving instances. Therefore, we adopt connect component labeling to identify each moving instance. (b) The object motion map is in 3D space ($x, y, z$-axis). Object motion is capable to capture the motion pattern at $z$-axis (moving forward/backward) such as this vehicle. In contrast, optical flow which lies in 2D space fails to capture the motion of this vehicle.
Figure 4: The instance-level motion mask might contain multiple moving objects bound together such as the rider and the motorcycle. The object discovery module takes an instance-level motion mask as input and predicts accurate object masks. Specifically, given a target image and its instance-level motion masks, we compute an objectness score map by computing the similarity of each query with all the keys in Eq. \ref{['eq:cosine_similarity']} and the processing in Eq. \ref{['eq:norm_rank_nms']}.
Figure 5: Directly utilizing instance-level motion masks might be sub-optimal as they are coarse masks for moving objects. The object discovery module is proposed to extract accurate object masks from coarse instance-level motion masks. The semantic mining module takes the moving object masks as guidance to refine the target pseudo labels.
...and 4 more figures

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

TL;DR

Abstract

MoDA: Leveraging Motion Priors from Videos for Advancing Unsupervised Domain Adaptation in Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)