Table of Contents
Fetching ...

Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

Hao Li, Daiwei Lu, Jesse d'Almeida, Dilara Isik, Ehsan Khodapanah Aghdam, Nick DiSanto, Ayberk Acar, Susheela Sharma, Jie Ying Wu, Robert J. Webster, Ipek Oguz

TL;DR

The paper addresses monocular absolute depth estimation in endoscopy, where obtaining metric depth is difficult due to domain shifts between synthetic translations and real endoscopic frames. It proposes unsupervised latent-space domain adaptation that jointly trains a depth network on translated synthetic images $I_s'$ and real images $I_r$, using a shared encoder with domain adversarial losses and a cosine-based latent consistency loss $L_{con}$ alongside the supervised loss $L_{sup}$. Evaluations on central airway obstruction phantoms show superior performance on both absolute and relative depth metrics compared with state-of-the-art methods, and results remain robust across backbone sizes and pretrained weights. This approach reduces the practical domain gap in endoscopic depth sensing, enabling more reliable metric depth for localization and 3D reconstruction in autonomous surgical robotics; code is available at https://github.com/MedICL-VU/MDE.

Abstract

Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.

Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

TL;DR

The paper addresses monocular absolute depth estimation in endoscopy, where obtaining metric depth is difficult due to domain shifts between synthetic translations and real endoscopic frames. It proposes unsupervised latent-space domain adaptation that jointly trains a depth network on translated synthetic images and real images , using a shared encoder with domain adversarial losses and a cosine-based latent consistency loss alongside the supervised loss . Evaluations on central airway obstruction phantoms show superior performance on both absolute and relative depth metrics compared with state-of-the-art methods, and results remain robust across backbone sizes and pretrained weights. This approach reduces the practical domain gap in endoscopic depth sensing, enabling more reliable metric depth for localization and 3D reconstruction in autonomous surgical robotics; code is available at https://github.com/MedICL-VU/MDE.

Abstract

Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.

Paper Structure

This paper contains 3 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Current and proposed methods for monocular depth estimation. (a) UDA: Synthetic image ($I_s$) is first translated into the style of real endoscopic images to obtain $I_s'$, which are then used to train a depth network. During inference, real image ($I_r$) is directly passed into the pretrained (fixed) depth network. (b) RG-UDA: As a close variant, $I_r$ is first passed through the generator to align with the generative training distribution before depth estimation. (c) DFA: With the pretrained depth network from (a), a separate encoder is trained for $I_r$ using a discriminator and adversarial loss ($L_{\text{adv}}$) to reduce the domain gap. At inference, the encoder and the pretrained depth decoder are used to predict depth from $I_r$. (d) Ours: Features from the ${I_s}'$ and $I_r$ are aligned in the latent space during training using supervised ($L_{\text{sup}}$), adversarial ($L_{\text{adv}}$), and consistency ($L_{\text{con}}$) losses. Unlike (a–c), which only partially leverage domain information, our approach updates the entire depth network using both domains during training to provide better adaptation to $I_r$.
  • Figure 2: (a-b) Illustrations of masks. (c-d) RMSE comparison between different pretrained weights and backbone sizes. "DA" and "Endo" denote the pretrained weights from Depth Anything v2 (Metric) yang2024depth and EndoOmni tian2024endoomni, respectively.
  • Figure 3: Qualitative results with different size backbones (ViTs, ViTb, and ViTl). The top row shows the AbsRel error maps whereas the bottom row shows the predicted depth maps. EndoOmni pretrained weights were used for each model. Larger backbones capture boundary depth information more accurately (yellow arrows). This is consistent with Fig. \ref{['backbone']}(c–d) where larger backbones have lower RMSE. However, larger errors (red arrows) are still observed in homogeneous regions.