Table of Contents
Fetching ...

Unified Domain Adaptive Semantic Segmentation

Zhe Zhang, Gaochang Wu, Jing Zhang, Xiatian Zhu, Dacheng Tao, Tianyou Chai

TL;DR

This work tackles unsupervised domain adaptation for semantic segmentation in both image and video settings by proposing Quad-directional Mixup (QuadMix), a unified domain augmentation framework. QuadMix performs four-directional pixel- and feature-level mixing across intra- and inter-domain pathways, generating diverse intermediate domains to bridge domain gaps, complemented by a flow-guided spatio-temporal feature aggregation module for fine-grained alignment in videos. The method supports both CNN and transformer backbones and achieves new state-of-the-art results on four challenging benchmarks, including Synthia-Seq→Cityscapes-Seq and VIPER→Cityscapes-Seq, with strong improvements over prior art in both image and video UDA-SS. Extensive ablations demonstrate the contributions of video patch templates, feature-level mixing, and especially the flow-guided aggregation, along with robust performance under varying hyperparameters and training regimes. The work provides a practical, end-to-end, unified framework that advances cross-domain knowledge transfer for dense prediction tasks in both image and video domains, with potential extensions to broader multi-modal and open-set scenarios.

Abstract

Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at https://github.com/ZHE-SAPI/UDASS.

Unified Domain Adaptive Semantic Segmentation

TL;DR

This work tackles unsupervised domain adaptation for semantic segmentation in both image and video settings by proposing Quad-directional Mixup (QuadMix), a unified domain augmentation framework. QuadMix performs four-directional pixel- and feature-level mixing across intra- and inter-domain pathways, generating diverse intermediate domains to bridge domain gaps, complemented by a flow-guided spatio-temporal feature aggregation module for fine-grained alignment in videos. The method supports both CNN and transformer backbones and achieves new state-of-the-art results on four challenging benchmarks, including Synthia-Seq→Cityscapes-Seq and VIPER→Cityscapes-Seq, with strong improvements over prior art in both image and video UDA-SS. Extensive ablations demonstrate the contributions of video patch templates, feature-level mixing, and especially the flow-guided aggregation, along with robust performance under varying hyperparameters and training regimes. The work provides a practical, end-to-end, unified framework that advances cross-domain knowledge transfer for dense prediction tasks in both image and video domains, with potential extensions to broader multi-modal and open-set scenarios.

Abstract

Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although the two lines of research share the major challenges -- overcoming the underlying domain distribution shift, their studies are largely independent, resulting in fragmented insights, a lack of holistic understanding, and missed opportunities for cross-pollination of ideas. This fragmentation prevents the unification of methods, leading to redundant efforts and suboptimal knowledge transfer across image and video domains. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general data augmentation perspective, serving as a unifying conceptual framework, enabling improved generalization, and potential for cross-pollination of ideas, ultimately contributing to the overall progress and practical impact of this field of research. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies through four-directional paths for intra- and inter-domain mixing in a feature space. To deal with temporal shifts with videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment. Extensive experiments show that our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks. Our source code and models will be released at https://github.com/ZHE-SAPI/UDASS.
Paper Structure (53 sections, 24 equations, 25 figures, 19 tables, 1 algorithm)

This paper contains 53 sections, 24 equations, 25 figures, 19 tables, 1 algorithm.

Figures (25)

  • Figure 1: Instead of studying UDA-SS for images and videos independently, we explore the unified study with a single approach.
  • Figure 2: Overview of the proposed QuadMix for UDA-SS. Image UDA-SS follows a parallel approach, with the exception of temporal cues, as indicated by the dashed lines. (i) In part (a), QuadMix comprises four comprehensive intra/inter-domain mixing paths to bridge domain gaps at spatio-temporal pixel and feature levels. Pixel-level mixing is performed on adjacent frames, optical flow, and labels/pseudo-labels, aiming to generate two enhanced inter-mixed domains that derived from intra-mixed domains iteratively: $\mathcal{T} \rightarrow (\mathcal{S} \rightarrow \mathcal{S})$ and $\mathcal{S} \rightarrow (\mathcal{T} \rightarrow \mathcal{T})$. These intermediates overcome the intra-domain discontinuity within $\mathcal{S}$ and $\mathcal{T}$ and exhibit more generalizable features for gap bridging. Additionally, feature-level mixing across quad-mixed domains alleviates feature inconsistencies caused by distinct domain-wise video contexts; (ii) In part (b), optical flow-guided spatio-temporal feature aggregation compresses video features across domains into a compacted category-aware space, minimizing intra-category discrepancies and enhancing inter-category discriminability for target domain; (iii) The training process is end-to-end. In part (c), stacked adjacent frames $X_t^\mathcal{T}$ and optical flow $o_{t - 1 \to t}^{ \mathcal{T}}$ are needed for target domain testing.
  • Figure 3: Comparison of domain mixing paradigms. Going beyond existing ideas with the limitations such as intra-domain discontinuity sactps, less generalizable feature distribution dacscmombdmadpl, and feature inconsistencies sactpsdacscmombdmadpl, the proposed QuadMix generalizes intra-mixed domains and enhances inter-mixed domains at both spatial (temporal) pixel- and feature-levels. The symbol "*" denotes the sample templates.
  • Figure 4: Examples of various mixing strategies in QuadMix for video UDA-SS: (a) $\mathcal{S}$ and $\mathcal{T}$ (before QuadMix), (b) $\mathcal{S^*}$ and $\mathcal{T^*}$ (source templates: $\textit{person}$ and $\textit{rider}$, target templates: $\textit{sign}$ and $\textit{sky}$), (c) $\mathcal{S} \rightarrow \mathcal{S}$ and $\mathcal{T} \rightarrow \mathcal{T}$ (intra-domain mixing), (d) $\mathcal{S} \rightarrow (\mathcal{T} \rightarrow \mathcal{T})$ and $\mathcal{T} \rightarrow (\mathcal{S} \rightarrow \mathcal{S})$ (further with inter-domain mixing, i.e. after QuadMix). The effects of these strategies on video frames, optical flow, and labels/pseudo-labels are illustrated. We present $x_t^{\mathcal{T} \rightarrow (\mathcal{S}\rightarrow \mathcal{S})}$ and $x_t^{\mathcal{S} \rightarrow (\mathcal{T}\rightarrow \mathcal{T})}$ without masks for better understanding. Notably, the patch templates required in training iteration $n$ are generated online adaptively from iteration $n-1$. Please zoom in for details.
  • Figure 5: Details of the quad-directional mixing for UDA-SS. To alleviate intra-domain discontinuity for comprehensive domain gap bridging, we conduct QuadMix at the spatial level (video adjacent frames and label), temporal level (optical flow), and spatio-temporal feature level, constructing more enhanced inter-mixed source $(\mathcal{T}$$\rightarrow$$(\mathcal{S}$$\rightarrow$$\mathcal{S}))$ and inter-mixed target $(\mathcal{S}$$\rightarrow$$(\mathcal{T}$$\rightarrow$$\mathcal{T}))$ domains that derived from intra-mixed domains $(\mathcal{S}$$\rightarrow$$\mathcal{S})$ and $(\mathcal{T}$$\rightarrow$$\mathcal{T})$ with generalized feature spaces. The mask $M_t^{\mathcal{D}^*}$ denotes the union of source patch template mask $M_t^{\mathcal{S}^*}$ and target patch template mask $M_t^{\mathcal{T}^*}$.
  • ...and 20 more figures