Table of Contents
Fetching ...

Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation

Mu Chen, Zhedong Zheng, Yi Yang

TL;DR

The paper addresses semantic segmentation under unsupervised domain adaptation by tackling unrealistic cross-domain layouts produced by naive mixing. It introduces a depth-guided Contextual Filter (DCF) to align mixed samples with real-world depth distributions and a cross-task encoder with Adaptive Feature Optimization (AFO) to fuse segmentation and depth information end-to-end. Using pseudo depth when ground-truth depth is unavailable, the method achieves competitive results on GTA→Cityscapes ($77.7$ mIoU) and SYNTHIA→Cityscapes ($69.3$ mIoU), with ablative evidence showing the benefits of DCF and AFO for both small and large objects. The approach improves context learning and data augmentation realism, enabling stronger cross-domain scene adaptation across transformer- and CNN-based architectures. Overall, depth-aware augmentation and multi-task fusion offer a practical path to more robust real-world semantic segmentation in diverse environments.

Abstract

Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA to Cityscapes and 69.3 mIoU on Synthia to Cityscapes.

Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation

TL;DR

The paper addresses semantic segmentation under unsupervised domain adaptation by tackling unrealistic cross-domain layouts produced by naive mixing. It introduces a depth-guided Contextual Filter (DCF) to align mixed samples with real-world depth distributions and a cross-task encoder with Adaptive Feature Optimization (AFO) to fuse segmentation and depth information end-to-end. Using pseudo depth when ground-truth depth is unavailable, the method achieves competitive results on GTA→Cityscapes ( mIoU) and SYNTHIA→Cityscapes ( mIoU), with ablative evidence showing the benefits of DCF and AFO for both small and large objects. The approach improves context learning and data augmentation realism, enabling stronger cross-domain scene adaptation across transformer- and CNN-based architectures. Overall, depth-aware augmentation and multi-task fusion offer a practical path to more robust real-world semantic segmentation in diverse environments.

Abstract

Scene segmentation via unsupervised domain adaptation (UDA) enables the transfer of knowledge acquired from source synthetic data to real-world target data, which largely reduces the need for manual pixel-level annotations in the target domain. To facilitate domain-invariant feature learning, existing methods typically mix data from both the source domain and target domain by simply copying and pasting the pixels. Such vanilla methods are usually sub-optimal since they do not take into account how well the mixed layouts correspond to real-world scenarios. Real-world scenarios are with an inherent layout. We observe that semantic categories, such as sidewalks, buildings, and sky, display relatively consistent depth distributions, and could be clearly distinguished in a depth map. Based on such observation, we propose a depth-aware framework to explicitly leverage depth estimation to mix the categories and facilitate the two complementary tasks, i.e., segmentation and depth learning in an end-to-end manner. In particular, the framework contains a Depth-guided Contextual Filter (DCF) forndata augmentation and a cross-task encoder for contextual learning. DCF simulates the real-world layouts, while the cross-task encoder further adaptively fuses the complementing features between two tasks. Besides, it is worth noting that several public datasets do not provide depth annotation. Therefore, we leverage the off-the-shelf depth estimation network to generate the pseudo depth. Extensive experiments show that our proposed methods, even with pseudo depth, achieve competitive performance on two widely-used bench-marks, i.e. 77.7 mIoU on GTA to Cityscapes and 69.3 mIoU on Synthia to Cityscapes.
Paper Structure (15 sections, 10 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 10 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Considering the driving scenario, we observe that the object location is relatively stable according to the distance from the camera. With such insight, we propose a Depth-guided Contextual Filter (DCF) which is aware of the semantic categories distribution in terms of Near, Middle, and Far view to facilitate cross-domain mixing. (b) Since we explicitly take the semantic layout into consideration, our method achieves more realistic mixed samples compared to existing state-of-the-art methods (Vanilla Mixed Sample) chen2023pipahoyer2022daformer. As shown in the red box, "new" buildings are pasted before the parked cars.
  • Figure 2: Source domain images $x^S$ and $x^T$ are mixed together, using the ground truth label $y^S$. The mixed images are de-noised by our proposed Depth-guided Contextual Filter (DCF) and then trained by the network. We illustrate DCF with a set of practical sample. As illustrated, the unrealistic "Building" pixels from the source image are mixed pasted to the target image, leading to a noisy mixed sample. DCF removes these pixels and maintain mixed pixels of "Traffic Sign" and "Pole" shown in the white dotted boxes, enhancing the realism of cross-domain mixing. (Best viewed when zooming in.)
  • Figure 3: The proposed multi-task learning framework. The input images $x^F$ are mixed from the source image $x^S$ and target domain $x^T$ according to the depth (Please refer to Figure \ref{['fig2']}). Then we are fed $x^S$ and $x^F$ into the high resolution encoder to generate high resolution predictions. To enhance multi-modal learning, the visual and depth feature created by the cross-task encoder are fused and fed into the proposed Adaptive Feature Optimization module (AFO) for multimodal communication. Finally, the multimodal communication via several transformer blocks incorporates and optimizes the fusion of depth information, improving the final visual predictions.
  • Figure 4: Qualitative results. From left to right: Target Image, Ground Truth, the visual results predicted by HRDA, MIC and Ours. We highlight prediction differences in white dash boxes and it is observed that the proposed method predicts clear edges.