Table of Contents
Fetching ...

Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, Robby T. Tan

TL;DR

This work tackles robust self-supervised monocular depth estimation under diverse adverse conditions by introducing a synthetic-to-real framework. It splits learning into synthetic adaptation (SA) that transfers daytime motion-structure via cost volumes and a real adaptation (RA) that uses consistency reweighting and a structure-prior constraint to bridge synthetic and real data. The approach yields state-of-the-art results across nuScenes, Robotcar, and DrivingStereo, with notable gains in AbsRel and RMSE and strong zero-shot generalization. The method leverages explicit depth distributions and differentiable histograms to regularize real-world predictions, offering a practical path toward robust depth in varied environments.

Abstract

Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.

Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

TL;DR

This work tackles robust self-supervised monocular depth estimation under diverse adverse conditions by introducing a synthetic-to-real framework. It splits learning into synthetic adaptation (SA) that transfers daytime motion-structure via cost volumes and a real adaptation (RA) that uses consistency reweighting and a structure-prior constraint to bridge synthetic and real data. The approach yields state-of-the-art results across nuScenes, Robotcar, and DrivingStereo, with notable gains in AbsRel and RMSE and strong zero-shot generalization. The method leverages explicit depth distributions and differentiable histograms to regularize real-world predictions, offering a practical path toward robust depth in varied environments.

Abstract

Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.

Paper Structure

This paper contains 17 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visual comparison between the previous gasperini_morbitzer2023md4all and ours. The first and second rows correspond to nighttime and rain, where ours provides more robust estimations especially for the vehicles.
  • Figure 2: Illustration of our proposed pipeline. (a) Synthetic adaptation utilizes an augmentation model to generate paired data for training, and we also conduct learning in an auxiliary motion space. (b) Real adaptation leverages the daytime model and synthetic model to provide valid pseudo-labels with consistency reweighting, and the daytime depth distribution from plenty of daytime predictions are seen as structure prior to constrain the model facing real adverse data. (c) Inference stage can work in multi-frame and single-frame settings.
  • Figure 3: Analysis on our design. (a) Illustration of the gaps between the synthetic and real data on performance. (b) Green boxes emphasize the different pattern between synthetic and real data, which affects the model's estimation especially for far planes. (c)-(d) The distribution difference between the daytime predicted depth and estimated depth in adverse conditions.
  • Figure 4: Qualitative comparison of depth predictions under adverse conditions. The first two rows refer to nighttime and rain conditions in nuScenes nuscenes2019 dataset, and the last two rows are zero-shot evaluation results on DrivingStereo drivingstereo dataset. Our method addresses the challenging cases indicated in green boxes (can be checked via the ground truth), where other methods fail.
  • Figure 5: Visualization of consistency maps.