Table of Contents
Fetching ...

LightDepth: Single-View Depth Self-Supervision from Illumination Decline

Javier Rodríguez-Puigvert, Víctor M. Batlle, J. M. M. Montiel, Ruben Martinez-Cantin, Pascal Fua, Juan D. Tardós, Javier Civera

TL;DR

In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces, and pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal.

Abstract

Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data.

LightDepth: Single-View Depth Self-Supervision from Illumination Decline

TL;DR

In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces, and pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal.

Abstract

Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data.
Paper Structure (16 sections, 8 equations, 5 figures, 5 tables)

This paper contains 16 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Single-view depth self-supervision in LightDepth. A two-headed deep network predicts albedo and depth from a single image and estimates surface normals from predicted depths. These are used to render a new image, that takes into account illumination decline and the endoscope's photometric calibration, and can be compared to the original one. Minimizing the difference between the original and rendered images is used at training time to compute the network weights and at inference time to refine the depth predictions.
  • Figure 2: Spotlight illumination model, a spotlight source at position $\mathbf{x}_l$ illuminates the surface point $\mathbf{x}_i$. The emission has $R(\psi_i)$ radial fall-off, suffers from an inverse-square decline with $\mathbf{x}_l \rightarrow \mathbf{x}_i$ and attenuates with the incidence angle ($\theta_i$). $\mathbf{l}_i$, $\mathbf{n}_i$, $\mathbf{r}_i$ and $\mathbf{s}_i$ are unit vectors.
  • Figure 3: Network Architecture.Left. The input image is fed into a neural network that predicts albedo and depth values for each pixel. From the estimated depths, we compute the normals at each pixel surface using a kernel-based approach. Then, the depths, albedos, and normals are sent to a differentiable renderer that takes into account illumination decline and the endoscope's photometric model, and generates a synthetic image that should be as similar as possible to the original one. We also use specular reflections in saturated pixels to self-supervise normals. We investigated two different architectures: Center. LightDepth U-Net is based on a standard U-Net Ronneberger2015 with two decoding branches. Right. LightDepth DPT is based on the DPT-Hybrid architecture Ranftl_2021_ICCV, with a second decoder branch added for the albedo.
  • Figure 4: DepthLight and DepthLight TTR on C3VD. Our light decline captures the correct shape of the cecum in the first image and the shape of the polyp in the second. Note how the estimates of normals and albedo are similar before and after TTR. By optimising depth by reducing illumination, DepthLight achieves a darker appearance and improvements in depth estimation.
  • Figure 5: Qualitative results on EndoMapper with LightDepth DPT. Columns 1--5 are real colonoscopy images, and columns 6--7 are real gastroscopy images. In colonoscopies, observe that the normals exhibit a tubular shape specific of the colon. The albedo prediction captures disruptions such as veins, blood, dirt, foam and specularites. Note the influence of light decline in the image and the correlation with the estimated depths.