Table of Contents
Fetching ...

C^2DA: Contrastive and Context-aware Domain Adaptive Semantic Segmentation

Md. Al-Masrur Khan, Zheng Chen, Lantao Liu

TL;DR

This work proposes a UDA-SS framework that learns both intra-domain and context-aware knowledge and adapts the Mask Image Modeling (MIM) technique to properly use context clues for robust visual recognition, using limited information about the masked images.

Abstract

Unsupervised domain adaptive semantic segmentation (UDA-SS) aims to train a model on the source domain data (e.g., synthetic) and adapt the model to predict target domain data (e.g., real-world) without accessing target annotation data. Most existing UDA-SS methods only focus on inter-domain knowledge to mitigate the data-shift problem. However, learning the inherent structure of the images and exploring the intrinsic pixel distribution of both domains are ignored, which prevents the UDA-SS methods from producing satisfactory performance like supervised learning. Moreover, incorporating contextual knowledge is also often overlooked. Considering these issues, in this work, we propose a UDA-SS framework that learns both intra-domain and context-aware knowledge. To learn the intra-domain knowledge, we incorporate contrastive loss in both domains, which pulls pixels of similar classes together and pushes the rest away, facilitating intra-image-pixel-wise correlations. To learn context-aware knowledge, we modify the mixing technique by leveraging contextual dependency among the classes. Moreover, we adapt the Mask Image Modeling (MIM) technique to properly use context clues for robust visual recognition, using limited information about the masked images. Comprehensive experiments validate that our proposed method improves the state-of-the-art UDA-SS methods by a margin of 0.51% mIoU and 0.54% mIoU in the adaptation of GTA-V->Cityscapes and Synthia->Cityscapes, respectively. We open-source our C2DA code. Code link: github.com/Masrur02/C-Squared-DA

C^2DA: Contrastive and Context-aware Domain Adaptive Semantic Segmentation

TL;DR

This work proposes a UDA-SS framework that learns both intra-domain and context-aware knowledge and adapts the Mask Image Modeling (MIM) technique to properly use context clues for robust visual recognition, using limited information about the masked images.

Abstract

Unsupervised domain adaptive semantic segmentation (UDA-SS) aims to train a model on the source domain data (e.g., synthetic) and adapt the model to predict target domain data (e.g., real-world) without accessing target annotation data. Most existing UDA-SS methods only focus on inter-domain knowledge to mitigate the data-shift problem. However, learning the inherent structure of the images and exploring the intrinsic pixel distribution of both domains are ignored, which prevents the UDA-SS methods from producing satisfactory performance like supervised learning. Moreover, incorporating contextual knowledge is also often overlooked. Considering these issues, in this work, we propose a UDA-SS framework that learns both intra-domain and context-aware knowledge. To learn the intra-domain knowledge, we incorporate contrastive loss in both domains, which pulls pixels of similar classes together and pushes the rest away, facilitating intra-image-pixel-wise correlations. To learn context-aware knowledge, we modify the mixing technique by leveraging contextual dependency among the classes. Moreover, we adapt the Mask Image Modeling (MIM) technique to properly use context clues for robust visual recognition, using limited information about the masked images. Comprehensive experiments validate that our proposed method improves the state-of-the-art UDA-SS methods by a margin of 0.51% mIoU and 0.54% mIoU in the adaptation of GTA-V->Cityscapes and Synthia->Cityscapes, respectively. We open-source our C2DA code. Code link: github.com/Masrur02/C-Squared-DA

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Robots are usually deployed to an environment that has a non-trivial domain shift from the source data, where the human's label/knowledge is provided. Our work is to support this kind of transfer learning tasks.
  • Figure 2: Framework overview of C$^2$DA. Given labeled source data {$x^S$, $y^S$}, we first calculate the source prediction $\hat{y}^{S}$ by using the student model. Later, we leverage the teacher model to predict pseudo-label $\bar{y}^{T}$. We craft the mixed label $y^{mix}$ and mixed data $x^{mix}$ by blending the images from both domains. We use the student model to predict mix prediction $\hat{y}^{mix}$. We also do the masking on target images to generate masked images $x^{ma}$ and leverage the student model to predict masked prediction images $\hat{y}^{ma}$ for learning contextual relations. Except for the segmentation losses we also use contrastive loss $L_{pix}$ for ensuring intra-class compactness and inter-class separability.
  • Figure 3: Illustration of the contextual advantage of the Prior-guided classmix over the conventional classmix.
  • Figure 4: Qualitative Results for the adaptation of GTA-V → Cityscapes.
  • Figure 5: Qualitative Results for the adaptation of RUGD → MESH.
  • ...and 1 more figures