EndoGMDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes
Liangjing Shao, Chenkang Du, Benshuang Chen, Xueli Liu, Xinrong Chen
TL;DR
EndoGMDE tackles monocular depth estimation in endoscopy under diverse illumination and tissue features by unifying a self-supervised learning strategy with a block-wise mixture of dynamic low-rank experts (BW-MoLE) for parameter-efficient foundation-model finetuning. It introduces an intrinsic-image decomposition–driven training pipeline to address brightness inconsistency and reflectance interference, and a staged optimization that includes optical-flow registration, intrinsic decomposition, and depth-map fine-tuning. The approach achieves state-of-the-art results on realistic datasets (e.g., SCARED, SimCol) and zero-shot generalization on Hamlyn, SERV-CT, and C3VD, with validated 3D reconstruction and ego-motion estimation, highlighting robust endoscopic 3D perception for clinical use. While offering strong performance, it notes higher training costs and modest real-time speed, pointing to future improvements in efficiency and multimodal integration.
Abstract
Self-supervised monocular depth estimation is a significant task for low-cost and efficient 3D scene perception and measurement in endoscopy. However, the variety of illumination conditions and scene features is still the primary challenges for depth estimation in endoscopic scenes. In this work, a novel self-supervised framework is proposed for monocular depth estimation in diverse endoscopy. Firstly, considering the diverse features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetune the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from low-rank experts which are allocated based on the generalization of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with brightness inconsistency and reflectance interference. The proposed method outperforms state-of-the-art works on SCARED dataset and SimCol dataset. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on C3VD, Hamlyn and SERV-CT dataset. The outstanding performance of our model is further demonstrated with 3D reconstruction and ego-motion estimation. The proposed method could contribute to accurate endoscopy for minimally invasive measurement and surgery. The evaluation codes will be released upon acceptance, while the demo videos can be found on: https://endo-gmde.netlify.app/.
