EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer
Yangke Li
TL;DR
EndoDepthL tackles monocular depth estimation in endoscopy under real-time and reflective-lighting constraints by introducing a lightweight CNN-Transformer encoder–decoder, a reflective confidence boundary mask, and a self-supervised loss framework. The method uses a total loss $L = L_p + \lambda L_s$ with $L_p = \mu L_R$ to drive depth learning while discounting unreliable regions, and employs dilated convolutions with cross-covariance attention to capture global context without large parameter growth. A novel complexity-evaluation metric combining parameter count, FLOPs, and FPS is proposed, and evaluation on the SCARED dataset demonstrates competitive accuracy with substantially reduced model size and faster inference compared to baselines. Ablation confirms the critical role of the reflective mask, particularly for lightweight configurations, supporting practical deployment for real-time endoscopic depth estimation.
Abstract
In this study, we address the key challenges concerning the accuracy and effectiveness of depth estimation for endoscopic imaging, with a particular emphasis on real-time inference and the impact of light reflections. We propose a novel lightweight solution named EndoDepthL that integrates Convolutional Neural Networks (CNN) and Transformers to predict multi-scale depth maps. Our approach includes optimizing the network architecture, incorporating multi-scale dilated convolution, and a multi-channel attention mechanism. We also introduce a statistical confidence boundary mask to minimize the impact of reflective areas. To better evaluate the performance of monocular depth estimation in endoscopic imaging, we propose a novel complexity evaluation metric that considers network parameter size, floating-point operations, and inference frames per second. We comprehensively evaluate our proposed method and compare it with existing baseline solutions. The results demonstrate that EndoDepthL ensures depth estimation accuracy with a lightweight structure.
