Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

Yonghao Xu; Pedram Ghamisi; Yannis Avrithis

Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

Yonghao Xu, Pedram Ghamisi, Yannis Avrithis

TL;DR

The paper tackles cross-domain semantic segmentation when multiple target domains are present but external data sharing is restricted. It presents MT-KD, a multi-target knowledge distillation framework that transfers knowledge from a labeled source and multiple targets to an adaptive student via supervised, consistency, and adversarial losses, optimized as a min-max objective $\\min_{F_S} \\max_{D_{out}} \\mathcal{L}_{ce} + \\lambda_{con} \\mathcal{L}_{con} + \\lambda_{out} \\mathcal{L}_{out}$. It adds UT-KD to rapidly adapt to unseen targets without external data using self-distillation and one-way adversarial learning with a frozen discriminator. A MT-STN module reduces cross-domain appearance gaps; on GTA5, CityScapes, IDD, and Mapillary, the approach achieves state-of-the-art results in synthetic-to-real and real-to-real transfers, with UT-KD offering practical, privacy-friendly adaptation.

Abstract

Multi-target unsupervised domain adaptation (UDA) aims to learn a unified model to address the domain shift between multiple target domains. Due to the difficulty of obtaining annotations for dense predictions, it has recently been introduced into cross-domain semantic segmentation. However, most existing solutions require labeled data from the source domain and unlabeled data from multiple target domains concurrently during training. Collectively, we refer to this data as "external". When faced with new unlabeled data from an unseen target domain, these solutions either do not generalize well or require retraining from scratch on all data. To address these challenges, we introduce a new strategy called "multi-target UDA without external data" for semantic segmentation. Specifically, the segmentation model is initially trained on the external data. Then, it is adapted to a new unseen target domain without accessing any external data. This approach is thus more scalable than existing solutions and remains applicable when external data is inaccessible. We demonstrate this strategy using a simple method that incorporates self-distillation and adversarial learning, where knowledge acquired from the external data is preserved during adaptation through "one-way" adversarial learning. Extensive experiments in several synthetic-to-real and real-to-real adaptation settings on four benchmark urban driving datasets show that our method significantly outperforms current state-of-the-art solutions, even in the absence of external data. Our source code is available online (https://github.com/YonghaoXu/UT-KD).

Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

TL;DR

. It adds UT-KD to rapidly adapt to unseen targets without external data using self-distillation and one-way adversarial learning with a frozen discriminator. A MT-STN module reduces cross-domain appearance gaps; on GTA5, CityScapes, IDD, and Mapillary, the approach achieves state-of-the-art results in synthetic-to-real and real-to-real transfers, with UT-KD offering practical, privacy-friendly adaptation.

Abstract

Paper Structure (23 sections, 15 equations, 12 figures, 13 tables)

This paper contains 23 sections, 15 equations, 12 figures, 13 tables.

Introduction
Related work
Single-target unsupervised domain adaptation
Multi-target unsupervised domain adaptation
Source-free domain adaptation
Domain generalization
Problem formulation
Methodology
Multi-target knowledge distillation
Unseen target knowledge distillation
Multi-target style transfer network
Experiments
Datasets and metrics
Implementation details
Synthetic-to-real adaptation
...and 8 more sections

Figures (12)

Figure 1: Different strategies in cross-domain semantic segmentation. (a) Single-target unsupervised domain adaptation (UDA): the segmentation model cannot generalize well to unseen domains. (b) Multi-target UDA: target domains are still predetermined at training and the model needs to be retrained from scratch on all data when a new unseen target domain is given, or else it will suffer from the same problem. (c) Our new strategy, multi-target UDA without external data: the pre-trained model is quickly adapted to a new unseen target domain without accessing any external data from the original source or target domains.
Figure 2: Illustration of our multi-target knowledge distillation (MT-KD). Given a set of labeled images $X_s$ from the source domain and unlabeled images $\mathcal{X}_t=\{X_{t_n}\}_{n=1}^N$ from multiple target domains, the student network $F_S$ is trained by cross-entropy $\mathcal{L}_{\textsc{ce}}$ on the source domain, consistency loss $\mathcal{L}_{\textrm{con}}$ on the target domains and adversarial loss $\mathcal{L}_{\textrm{out}}$ in the output space. The teacher network $F_T$ is obtained by the exponential moving average (EMA) of $F_S$ parameters. Only one target domain is shown for brevity.
Figure 3: Illustration of our unseen target knowledge distillation (UT-KD). Given a set of unlabeled mages $X_u$ from an unseen target domain, UT-KD distills and adapts the knowledge from a pre-trained MT-KD model by self-distillation and one-way adversarial learning. Both student and teacher networks $F'_S, F'_T$ are initialized from the pre-trained model. Same for the discriminator $D_{\textrm{out}}$, which remains frozen.
Figure 4: Illustration of our multi-target style transfer network (MT-STN). Given a set of labeled images $X_s$ from the source domain and unlabeled images $\mathcal{X}_t=\{X_{t_n}\}_{n=1}^N$ from multiple target domains, the style transfer network $T$ learns to either reconstruct, guided by the reconstruction loss $\mathcal{L}_{\textrm{rec}}$, or transfer the style of the input image to another domain, guided by the adversarial loss $\mathcal{L}_{\textrm{adv}}$, depending on the style parameters $V$ that are plugged into $T$ as shown in \ref{['fig:t']}. There is one discriminator $D_s, \mathcal{D}_t=\{D_{t_n}\}_{n=1}^N$ and one set of learnable style parameters $V_s, \mathcal{V}_t=\{V_{t_n}\}_{n=1}^N$ for each domain. We use $x_{a \to b}$ to denote the transferred image from domain $a$ to $b$. Learning is unsupervised. Only one target domain is shown for brevity.
Figure 5: Architecture of style transfer network $T$ in our MT-STN. Domain style parameters $V$ are plugged into $T$ as parameters of a series of conditional instance normalization (CIN) layers. Here, input image $x_{t_n}$ from target domain $X_{t_n}$ is transferred to the style $V_s$ of source domain $X_s$, denoted as $x_{t_n \to s} = T(x_{t_n}, V_s)$. More examples shown in \ref{['fig:mtstn']}.
...and 7 more figures

Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

TL;DR

Abstract

Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data

Authors

TL;DR

Abstract

Table of Contents

Figures (12)