Table of Contents
Fetching ...

Uncertainty and Self-Supervision in Single-View Depth

Javier Rodriguez-Puigvert

TL;DR

This work tackles single-view depth estimation by integrating uncertainty quantification and self-supervision to address data scarcity and ill-posedness. It first establishes a Bayesian framework for supervised single-view depth, comparing MC dropout and deep ensembles and demonstrating that encoder-focused dropout yields strong depth estimates while ensembles yield superior uncertainty calibration. It then extends to the medical domain with colonoscopy imagery, introducing an uncertain-teacher mechanism that improves cross-domain generalization from synthetic to real data. Finally, it introduces LightDepth, a purely self-supervised method that exploits illumination decline from a co-located light source to train depth, albedo, and normals from a single image, with test-time refinement achieving near-supervised performance on phantom data and robust real-endoscopy results. Collectively, the work advances practical, uncertainty-aware depth estimation in challenging domains, enabling safer robotics and medical AI applications without heavy reliance on labeled depth data.

Abstract

Single-view depth estimation refers to the ability to derive three-dimensional information per pixel from a single two-dimensional image. Single-view depth estimation is an ill-posed problem because there are multiple depth solutions that explain 3D geometry from a single view. While deep neural networks have been shown to be effective at capturing depth from a single view, the majority of current methodologies are deterministic in nature. Accounting for uncertainty in the predictions can avoid disastrous consequences when applied to fields such as autonomous driving or medical robotics. We have addressed this problem by quantifying the uncertainty of supervised single-view depth for Bayesian deep neural networks. There are scenarios, especially in medicine in the case of endoscopic images, where such annotated data is not available. To alleviate the lack of data, we present a method that improves the transition from synthetic to real domain methods. We introduce an uncertainty-aware teacher-student architecture that is trained in a self-supervised manner, taking into account the teacher uncertainty. Given the vast amount of unannotated data and the challenges associated with capturing annotated depth in medical minimally invasive procedures, we advocate a fully self-supervised approach that only requires RGB images and the geometric and photometric calibration of the endoscope. In endoscopic imaging, the camera and light sources are co-located at a small distance from the target surfaces. This setup indicates that brighter areas of the image are nearer to the camera, while darker areas are further away. Building on this observation, we exploit the fact that for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance. We propose the use of illumination as a strong single-view self-supervisory signal for deep neural networks.

Uncertainty and Self-Supervision in Single-View Depth

TL;DR

This work tackles single-view depth estimation by integrating uncertainty quantification and self-supervision to address data scarcity and ill-posedness. It first establishes a Bayesian framework for supervised single-view depth, comparing MC dropout and deep ensembles and demonstrating that encoder-focused dropout yields strong depth estimates while ensembles yield superior uncertainty calibration. It then extends to the medical domain with colonoscopy imagery, introducing an uncertain-teacher mechanism that improves cross-domain generalization from synthetic to real data. Finally, it introduces LightDepth, a purely self-supervised method that exploits illumination decline from a co-located light source to train depth, albedo, and normals from a single image, with test-time refinement achieving near-supervised performance on phantom data and robust real-endoscopy results. Collectively, the work advances practical, uncertainty-aware depth estimation in challenging domains, enabling safer robotics and medical AI applications without heavy reliance on labeled depth data.

Abstract

Single-view depth estimation refers to the ability to derive three-dimensional information per pixel from a single two-dimensional image. Single-view depth estimation is an ill-posed problem because there are multiple depth solutions that explain 3D geometry from a single view. While deep neural networks have been shown to be effective at capturing depth from a single view, the majority of current methodologies are deterministic in nature. Accounting for uncertainty in the predictions can avoid disastrous consequences when applied to fields such as autonomous driving or medical robotics. We have addressed this problem by quantifying the uncertainty of supervised single-view depth for Bayesian deep neural networks. There are scenarios, especially in medicine in the case of endoscopic images, where such annotated data is not available. To alleviate the lack of data, we present a method that improves the transition from synthetic to real domain methods. We introduce an uncertainty-aware teacher-student architecture that is trained in a self-supervised manner, taking into account the teacher uncertainty. Given the vast amount of unannotated data and the challenges associated with capturing annotated depth in medical minimally invasive procedures, we advocate a fully self-supervised approach that only requires RGB images and the geometric and photometric calibration of the endoscope. In endoscopic imaging, the camera and light sources are co-located at a small distance from the target surfaces. This setup indicates that brighter areas of the image are nearer to the camera, while darker areas are further away. Building on this observation, we exploit the fact that for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance. We propose the use of illumination as a strong single-view self-supervisory signal for deep neural networks.
Paper Structure (69 sections, 24 equations, 35 figures, 12 tables)

This paper contains 69 sections, 24 equations, 35 figures, 12 tables.

Figures (35)

  • Figure 1: ICCV 2023 Poster Session
  • Figure 2: Bayesian single-view depth predicton for a SceneNet image. In the middle row the small depth error, and how the total uncertainty models it accurately. In the bottom row how epistemic and aleatoric sources are both significant and relevant for uncertainty quantification.
  • Figure 3: Variations of MC dropout in our experiments.
  • Figure 4: Comparison of MC dropout variations and deep ensembles for different numbers of forward passes $M$. Left: RMSE. Right: AUSE. The higher $M$ is, the better the performance, but with slight improvements for $M>18$.
  • Figure 5: Calibration curves (AUCE and AUSE) for MC dropout and deep ensembles.
  • ...and 30 more figures