EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer

Yangke Li

EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer

Yangke Li

TL;DR

EndoDepthL tackles monocular depth estimation in endoscopy under real-time and reflective-lighting constraints by introducing a lightweight CNN-Transformer encoder–decoder, a reflective confidence boundary mask, and a self-supervised loss framework. The method uses a total loss $L = L_p + \lambda L_s$ with $L_p = \mu L_R$ to drive depth learning while discounting unreliable regions, and employs dilated convolutions with cross-covariance attention to capture global context without large parameter growth. A novel complexity-evaluation metric combining parameter count, FLOPs, and FPS is proposed, and evaluation on the SCARED dataset demonstrates competitive accuracy with substantially reduced model size and faster inference compared to baselines. Ablation confirms the critical role of the reflective mask, particularly for lightweight configurations, supporting practical deployment for real-time endoscopic depth estimation.

Abstract

In this study, we address the key challenges concerning the accuracy and effectiveness of depth estimation for endoscopic imaging, with a particular emphasis on real-time inference and the impact of light reflections. We propose a novel lightweight solution named EndoDepthL that integrates Convolutional Neural Networks (CNN) and Transformers to predict multi-scale depth maps. Our approach includes optimizing the network architecture, incorporating multi-scale dilated convolution, and a multi-channel attention mechanism. We also introduce a statistical confidence boundary mask to minimize the impact of reflective areas. To better evaluate the performance of monocular depth estimation in endoscopic imaging, we propose a novel complexity evaluation metric that considers network parameter size, floating-point operations, and inference frames per second. We comprehensively evaluate our proposed method and compare it with existing baseline solutions. The results demonstrate that EndoDepthL ensures depth estimation accuracy with a lightweight structure.

EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer

TL;DR

with

to drive depth learning while discounting unreliable regions, and employs dilated convolutions with cross-covariance attention to capture global context without large parameter growth. A novel complexity-evaluation metric combining parameter count, FLOPs, and FPS is proposed, and evaluation on the SCARED dataset demonstrates competitive accuracy with substantially reduced model size and faster inference compared to baselines. Ablation confirms the critical role of the reflective mask, particularly for lightweight configurations, supporting practical deployment for real-time endoscopic depth estimation.

Abstract

Paper Structure (20 sections, 16 equations, 4 figures, 4 tables)

This paper contains 20 sections, 16 equations, 4 figures, 4 tables.

Introduction
Lightweight CNN-Transformer Encoder
Reflective Mask
Complexity Evaluation
Related Work
Self-supervised Mono-Depth Estimation
Endoscopic Image Analysis
Methodology
Self Supervised Loss Function
CNN-Transformer Lightweight Depth Network
Statistical Confidence Boundary Mask
Evaluation and Validation
Experiment Setup
Dataset
Hyperparameters
...and 5 more sections

Figures (4)

Figure 1: Comparison of EndoDepthL with the baseline. Our method effectively deals with the challenges in endoscopy, such as uneven lighting.
Figure 2: Overview of the proposed method. Put the source and target frames into the pose network and the target frame into the depth network. Each network extracts respective features: the pose network determines the transition from the source to the target, and the depth network produces initial depth predictions. Then reduces the reconstruction error by leveraging the camera's inherent parameters. To wrap up, a Statistical Confidence Boundary Mask is used to counteract the effects of light reflection, ensuring a more precise and stable result.
Figure 3: Depth network. We've enhanced feature extraction in the Encoder by incorporating an Encoder Block, consisting of convolution and attention components. We propose two Encoder network sizes(efficiency and performance) to meet varied requirements, as detailed in Table I.
Figure 4: Experimental result for our analysis. We extracted representative frames from two distinct video segments. These carefully chosen frames encompass various perspectives, including frontal and lateral viewpoints, and capture different degrees of organ exposure. In some instances, the organs are fully visible, while in others, they are partially obscured or covered. From this figure, we can see that the EndoDepthL performance model is better with smoother and more accurate depth estimation.

EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer

TL;DR

Abstract

EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)