Table of Contents
Fetching ...

Consistency Regularisation for Unsupervised Domain Adaptation in Monocular Depth Estimation

Amir El-Ghoussani, Julia Hornauer, Gustavo Carneiro, Vasileios Belagiannis

TL;DR

This work tackles unsupervised domain adaptation for monocular depth estimation by reframing it as a consistency-based semi-supervised problem that only requires source-domain labels. It introduces a single-model approach combining a pairwise source loss with multi-view perturbation consistency on unlabelled target data, inspired by FixMatch but adapted for continuous depth regression. The method employs RandAugment-based target perturbations and a CutMix-based pretraining stage, resulting in a total loss that balances supervised and unsupervised objectives via a batch-ratio parameter. Across outdoor (virtual KITTI to KITTI) and indoor (SceneNet to NYUv2) benchmarks, the approach achieves state-of-the-art domain-adaptation performance and is validated through thorough ablations. The proposed framework offers a simple, effective alternative to complex multi-model pipelines, with practical impact for deploying depth estimation systems across diverse environments.

Abstract

In monocular depth estimation, unsupervised domain adaptation has recently been explored to relax the dependence on large annotated image-based depth datasets. However, this comes at the cost of training multiple models or requiring complex training protocols. We formulate unsupervised domain adaptation for monocular depth estimation as a consistency-based semi-supervised learning problem by assuming access only to the source domain ground truth labels. To this end, we introduce a pairwise loss function that regularises predictions on the source domain while enforcing perturbation consistency across multiple augmented views of the unlabelled target samples. Importantly, our approach is simple and effective, requiring only training of a single model in contrast to the prior work. In our experiments, we rely on the standard depth estimation benchmarks KITTI and NYUv2 to demonstrate state-of-the-art results compared to related approaches. Furthermore, we analyse the simplicity and effectiveness of our approach in a series of ablation studies. The code is available at \url{https://github.com/AmirMaEl/SemiSupMDE}.

Consistency Regularisation for Unsupervised Domain Adaptation in Monocular Depth Estimation

TL;DR

This work tackles unsupervised domain adaptation for monocular depth estimation by reframing it as a consistency-based semi-supervised problem that only requires source-domain labels. It introduces a single-model approach combining a pairwise source loss with multi-view perturbation consistency on unlabelled target data, inspired by FixMatch but adapted for continuous depth regression. The method employs RandAugment-based target perturbations and a CutMix-based pretraining stage, resulting in a total loss that balances supervised and unsupervised objectives via a batch-ratio parameter. Across outdoor (virtual KITTI to KITTI) and indoor (SceneNet to NYUv2) benchmarks, the approach achieves state-of-the-art domain-adaptation performance and is validated through thorough ablations. The proposed framework offers a simple, effective alternative to complex multi-model pipelines, with practical impact for deploying depth estimation systems across diverse environments.

Abstract

In monocular depth estimation, unsupervised domain adaptation has recently been explored to relax the dependence on large annotated image-based depth datasets. However, this comes at the cost of training multiple models or requiring complex training protocols. We formulate unsupervised domain adaptation for monocular depth estimation as a consistency-based semi-supervised learning problem by assuming access only to the source domain ground truth labels. To this end, we introduce a pairwise loss function that regularises predictions on the source domain while enforcing perturbation consistency across multiple augmented views of the unlabelled target samples. Importantly, our approach is simple and effective, requiring only training of a single model in contrast to the prior work. In our experiments, we rely on the standard depth estimation benchmarks KITTI and NYUv2 to demonstrate state-of-the-art results compared to related approaches. Furthermore, we analyse the simplicity and effectiveness of our approach in a series of ablation studies. The code is available at \url{https://github.com/AmirMaEl/SemiSupMDE}.
Paper Structure (45 sections, 11 equations, 8 figures, 11 tables)

This paper contains 45 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Following pretraining on the source domain using CutMix yun_cutmix_2019 data augmentation, we observe that the resulting depth predictions show some fidelity but could benefit from further refinement in certain areas. Specifically, they appear overly fragmented or "edgy" in localised regions (highlighted in the figure). Based on this observation we carefully design our consistency-based approach for domain adaptation in monocular depth estimation, to particularly smooth localised and fragmented regions.
  • Figure 2: Overview of the approach. Initially, we sample two independent source domain images $\mathbf{x}_{s,1}$ and $\mathbf{x}_{s,2}$ along with one target domain image $\mathbf{x}_u$. This target domain image is then fed into two perturbation streams, in which independent augmentations are applied to the target image, denoted as $\Tilde{\mathbf{x}}_{u,1}$ and $\Tilde{\mathbf{x}}_{u,2}$. In total five samples are concatenated and fed into the depth estimation model $f(\cdot)$. Afterwards, the predictions are chunked back into their initial shapes. The loss on the supervised domain is computed by enforcing consistency between the sum of predictions and the sum of the two ground truth samples. Finally, the unsupervised loss is calculated by enforcing consistency between generated perturbations. Yellow and blue colors correspond to the supervised source and the unsupervised target domain, respectively.
  • Figure 3: CutMix augmented input images $\mathbf{x}_{cm,s}$ in the source domain along with their corresponding augmented ground truth depth annotation $\mathbf{y}_{cm,s}$. We choose $\alpha = 0.5$, controlling the patch size of the CutMix augmentation.
  • Figure 4: Qualitative results on KITTI geiger_vision_2013_kitti with models trained on vKITTI-KITTI. Ground truth depth is linearly interpolated for visualization.
  • Figure 5: CutMix augmented images with random jitter and rotation augmentations in $D_S$ for images in vKITTI.
  • ...and 3 more figures