Table of Contents
Fetching ...

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Akshay Paruchuri, Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, Roni Sengupta

TL;DR

The photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, are utilized to improve monocular depth estimation and teacher-student transfer learning is introduced to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision.

Abstract

Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data. Our code, pre-trained models, and supplementary materials can be found on our project page: https://ppsnet.github.io/

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

TL;DR

The photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, are utilized to improve monocular depth estimation and teacher-student transfer learning is introduced to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision.

Abstract

Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data. Our code, pre-trained models, and supplementary materials can be found on our project page: https://ppsnet.github.io/
Paper Structure (31 sections, 14 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 31 sections, 14 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Our approach models near-field lighting, emitted by the endoscope and reflected by the surface, as Per-Pixel Shading (PPS). We use PPS feature to perform depth refinement (PPSNet) on clinical data using teacher-student transfer learning and a PPS-informed self-supervision. Our method outperforms the state-of-the-art monocular depth estimation technique LightDepth rodríguezpuigvert2023lightdepth, which trains using self-supervision that explicitly reconstructs input images using illumination decline.
  • Figure 2: An overview of our proposed approach to train a student network capable of producing high quality depth maps on both synthetic colonoscopy data and real clinical colonoscopy data. For simplicity, $PPSNet$ here corresponds to lines 4-7 of alg. \ref{['alg:main_algo']}.
  • Figure 3: An example of how we compute our PPS representation using depths and surface normals. The computed albedo-modulated PPS representation is strongly correlated to the corresponding input image.
  • Figure 4: Qualitative evaluation on the C3VD dataset. Red = further distance from the camera and blue is closer. Regions to note during visual comparison are outlined in white. Ours-Teacher performs best, improving upon Ours-Bakcbone due to our proposed depth refinement PPSNet and self-supervised PPS loss.
  • Figure 5: Qualitative evaluation on clinical data. Red = further distance from the camera and blue is closer. Additional results can be found in the appendices.
  • ...and 4 more figures