Table of Contents
Fetching ...

EndoGMDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

Liangjing Shao, Chenkang Du, Benshuang Chen, Xueli Liu, Xinrong Chen

TL;DR

EndoGMDE tackles monocular depth estimation in endoscopy under diverse illumination and tissue features by unifying a self-supervised learning strategy with a block-wise mixture of dynamic low-rank experts (BW-MoLE) for parameter-efficient foundation-model finetuning. It introduces an intrinsic-image decomposition–driven training pipeline to address brightness inconsistency and reflectance interference, and a staged optimization that includes optical-flow registration, intrinsic decomposition, and depth-map fine-tuning. The approach achieves state-of-the-art results on realistic datasets (e.g., SCARED, SimCol) and zero-shot generalization on Hamlyn, SERV-CT, and C3VD, with validated 3D reconstruction and ego-motion estimation, highlighting robust endoscopic 3D perception for clinical use. While offering strong performance, it notes higher training costs and modest real-time speed, pointing to future improvements in efficiency and multimodal integration.

Abstract

Self-supervised monocular depth estimation is a significant task for low-cost and efficient 3D scene perception and measurement in endoscopy. However, the variety of illumination conditions and scene features is still the primary challenges for depth estimation in endoscopic scenes. In this work, a novel self-supervised framework is proposed for monocular depth estimation in diverse endoscopy. Firstly, considering the diverse features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetune the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from low-rank experts which are allocated based on the generalization of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with brightness inconsistency and reflectance interference. The proposed method outperforms state-of-the-art works on SCARED dataset and SimCol dataset. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on C3VD, Hamlyn and SERV-CT dataset. The outstanding performance of our model is further demonstrated with 3D reconstruction and ego-motion estimation. The proposed method could contribute to accurate endoscopy for minimally invasive measurement and surgery. The evaluation codes will be released upon acceptance, while the demo videos can be found on: https://endo-gmde.netlify.app/.

EndoGMDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

TL;DR

EndoGMDE tackles monocular depth estimation in endoscopy under diverse illumination and tissue features by unifying a self-supervised learning strategy with a block-wise mixture of dynamic low-rank experts (BW-MoLE) for parameter-efficient foundation-model finetuning. It introduces an intrinsic-image decomposition–driven training pipeline to address brightness inconsistency and reflectance interference, and a staged optimization that includes optical-flow registration, intrinsic decomposition, and depth-map fine-tuning. The approach achieves state-of-the-art results on realistic datasets (e.g., SCARED, SimCol) and zero-shot generalization on Hamlyn, SERV-CT, and C3VD, with validated 3D reconstruction and ego-motion estimation, highlighting robust endoscopic 3D perception for clinical use. While offering strong performance, it notes higher training costs and modest real-time speed, pointing to future improvements in efficiency and multimodal integration.

Abstract

Self-supervised monocular depth estimation is a significant task for low-cost and efficient 3D scene perception and measurement in endoscopy. However, the variety of illumination conditions and scene features is still the primary challenges for depth estimation in endoscopic scenes. In this work, a novel self-supervised framework is proposed for monocular depth estimation in diverse endoscopy. Firstly, considering the diverse features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetune the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from low-rank experts which are allocated based on the generalization of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with brightness inconsistency and reflectance interference. The proposed method outperforms state-of-the-art works on SCARED dataset and SimCol dataset. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on C3VD, Hamlyn and SERV-CT dataset. The outstanding performance of our model is further demonstrated with 3D reconstruction and ego-motion estimation. The proposed method could contribute to accurate endoscopy for minimally invasive measurement and surgery. The evaluation codes will be released upon acceptance, while the demo videos can be found on: https://endo-gmde.netlify.app/.

Paper Structure

This paper contains 29 sections, 21 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Challenges (left) and performance comparison (right) for depth estimation in diverse endoscopic scenes. The size of the circle denotes the number of trainable parameters in the corresponding depth estimation network.
  • Figure 2: The overview of this work and the pipeline of the proposed self-supervised training.
  • Figure 3: The framework for intrinsic image decomposition
  • Figure 4: The proposed parameter-efficient finetuning of the depth map prediction network.
  • Figure 5: Allocation of experts. 'Param. of Power Law' denotes the $\tau$ in Eq. \ref{['dist']}
  • ...and 7 more figures