Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

Sijia Du; Chengfeng Zhou; Suncheng Xiang; Jianwei Xu; Dahong Qian

Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

Sijia Du, Chengfeng Zhou, Suncheng Xiang, Jianwei Xu, Dahong Qian

TL;DR

A novel approach to estimate depth maps for endoscopy images despite the complex conditions in clinic is offered, serving as a foundation for endoscopic automatic navigation and other clinical tasks, such as polyp detection and segmentation.

Abstract

Objective: Depth estimation is crucial for endoscopic navigation and manipulation, but obtaining ground-truth depth maps in real clinical scenarios, such as the colon, is challenging. This study aims to develop a robust framework that generalizes well to real colonoscopy images, overcoming challenges like non-Lambertian surface reflection and diverse data distributions. Methods: We propose a framework combining a convolutional neural network (CNN) for capturing local features and a Transformer for capturing global information. An uncertainty-based fusion block was designed to enhance generalization by identifying complementary contributions from the CNN and Transformer branches. The network can be trained with simulated datasets and generalize directly to unseen clinical data without any fine-tuning. Results: Our method is validated on multiple datasets and demonstrates an excellent generalization ability across various datasets and anatomical structures. Furthermore, qualitative analysis in real clinical scenarios confirmed the robustness of the proposed method. Conclusion: The integration of local and global features through the CNN-Transformer architecture, along with the uncertainty-based fusion block, improves depth estimation performance and generalization in both simulated and real-world endoscopic environments. Significance: This study offers a novel approach to estimate depth maps for endoscopy images despite the complex conditions in clinic, serving as a foundation for endoscopic automatic navigation and other clinical tasks, such as polyp detection and segmentation.

Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

TL;DR

Abstract

Paper Structure (21 sections, 12 equations, 8 figures, 3 tables)

This paper contains 21 sections, 12 equations, 8 figures, 3 tables.

Introduction
Related Work
Monocular Depth Estimation for Endoscopic Images
Fusion of CNN and Transformer
Uncertainty Estimation for Monocular Depth Estimation
Methods
Local Branch and Global Branch
Uncertainty-based Fusion Module
Uncertainty Estimation for Depth Predictions
Global and Local Branch Fusion with Uncertainty Map
Loss Function
Data Augmentation
Experiments and Results
Implement Details
Ablation Study
...and 6 more sections

Figures (8)

Figure 1: Visualization of the estimated depth map, uncertainty, and error map of CNN and Transformer. Brighter regions in the heatmap indicate larger values. The arrows indicate regions where CNN and Transformer are not proficient at prediction. The CNN is not adept at predicting reflective regions while the Transformer is not skilled in the depth prediction of structures such as edges, which is complementary and can be shown on the uncertainty maps, respectively.
Figure 2: The structure of the proposed colonoscopy depth estimation network. It consists of a local branch, a global branch, and an uncertainty-based fusion module. Local information is extracted using CNN, while global features are emphasized by Transformer. Finally, the uncertainty-based fusion module refines the predictions from two branches using uncertainty maps.
Figure 3: The topology of the proposed uncertainty-based fusion module. Layers in global and local branches pass through convolutions for uncertainty estimation. The obtained uncertainty maps are transformed into confidence maps and passed through a Softmax layer to weigh the respective depth maps. The weighted results of the branches are then combined for the final output.
Figure 4: The intermediate output of the network. The brighter the color, the larger the uncertainty, confidence, and depth value. The arrows highlight the regions where the predicted depth maps are refined according to the uncertainty maps, where the yellow ones represent corrections based on CNN and the blue ones represent Transformer-based corrections.
Figure 5: Qualitative comparison of different methods on the EndoSLAM dataset with different anatomical structures. Black boxes highlight regions where our method achieves better predictions. Our method can reconstruct more detailed depth maps without being affected by reflective regions.
...and 3 more figures

Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

TL;DR

Abstract

Generalizing monocular colonoscopy image depth estimation by uncertainty-based global and local fusion network

Authors

TL;DR

Abstract

Table of Contents

Figures (8)