Table of Contents
Fetching ...

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher

TL;DR

This work tackles monocular depth and pose estimation in endoscopy, where obtaining ground-truth supervision is difficult and scene conditions are highly challenging. It introduces a dual framework: DepthNet conditioned by a Generative Latent Bank pretrained on depth maps and a VAE-constrained PoseNet that treats pose transitions as latent variables, regularized by a KL divergence to a standard Gaussian. The model learns via self-supervised reprojection losses, achieving state-of-the-art performance on SimCol and EndoSLAM datasets and showing strong ablation-supported gains from both latent priors and VAE regularization. The proposed approach enables more reliable 3D lesion mapping in the GI tract and paves the way for robust endoscopic 3D reconstruction, with future work planned for clinical validation and 3D colon reconstruction. Key quantities include depth maps $d_i$, pose differences $z_{pos}$, reprojection loss $L_{reproj}$, and KL regularization $D_{KL}(q(z_{pos})||\mathcal{N}(0,I))$.

Abstract

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

TL;DR

This work tackles monocular depth and pose estimation in endoscopy, where obtaining ground-truth supervision is difficult and scene conditions are highly challenging. It introduces a dual framework: DepthNet conditioned by a Generative Latent Bank pretrained on depth maps and a VAE-constrained PoseNet that treats pose transitions as latent variables, regularized by a KL divergence to a standard Gaussian. The model learns via self-supervised reprojection losses, achieving state-of-the-art performance on SimCol and EndoSLAM datasets and showing strong ablation-supported gains from both latent priors and VAE regularization. The proposed approach enables more reliable 3D lesion mapping in the GI tract and paves the way for robust endoscopic 3D reconstruction, with future work planned for clinical validation and 3D colon reconstruction. Key quantities include depth maps , pose differences , reprojection loss , and KL regularization .

Abstract

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Workflow for 3D lesion mapping in endoscopy: depth estimation generates scene depth maps from monocular video frames, which, combined with pose trajectory, allow 3D reconstruction of the colon and precise lesion localization for improved diagnostics and surgical planning.
  • Figure 2: Overview of the proposed method. The method consists of a depth estimation network with a pre-trained Generative Latent Bank and a VAE-constrained pose estimation network. The entire method is self-supervised training through subsequent reprojection as a supervision signal.
  • Figure 3: The pre-training process of Generative Latent Bank. A Gaussian latent vector is upsampled through transpose convolutions, with adaptive noise injections at each resolution, to produce variable depth maps. Trained in a GAN framework, the latent bank generates realistic depth maps, while a discriminator classifies real and synthetic maps to refine generation quality.
  • Figure 4: Qualitative results of depth estimation. Here are some depth maps generated by Monodepth2 godard2019digging, MonoVit zhao2022monovit, Lite-Mono zhang2023lite and Ours. Our method demonstrates superior performance, particularly on challenging phantom and real frames, where complex textures and lighting variations pose significant challenges for depth and pose estimation.
  • Figure 5: Qualitative results of pose estimation. Our method outperforms others by achieving accurate x, y, and z scale consistency and improved rotation angle alignment.