Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

Ziang Xu; Bin Li; Yang Hu; Chenyu Zhang; James East; Sharib Ali; Jens Rittscher

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher

TL;DR

This work tackles monocular depth and pose estimation in endoscopy, where obtaining ground-truth supervision is difficult and scene conditions are highly challenging. It introduces a dual framework: DepthNet conditioned by a Generative Latent Bank pretrained on depth maps and a VAE-constrained PoseNet that treats pose transitions as latent variables, regularized by a KL divergence to a standard Gaussian. The model learns via self-supervised reprojection losses, achieving state-of-the-art performance on SimCol and EndoSLAM datasets and showing strong ablation-supported gains from both latent priors and VAE regularization. The proposed approach enables more reliable 3D lesion mapping in the GI tract and paves the way for robust endoscopic 3D reconstruction, with future work planned for clinical validation and 3D colon reconstruction. Key quantities include depth maps $d_i$, pose differences $z_{pos}$, reprojection loss $L_{reproj}$, and KL regularization $D_{KL}(q(z_{pos})||\mathcal{N}(0,I))$.

Abstract

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

TL;DR

Abstract

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)