EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams
Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz
TL;DR
The paper tackles real-time monocular depth estimation for endoscopic video, where prior streaming approaches either rely on batched frames or heavy diffusion models. It introduces EndoStreamDepth, combining an endoscopy-robust single-frame depth backbone with hierarchical, multi-level temporal Mamba modules, EST augmentation, and multi-term supervision to produce temporally stable, sharp depth maps at real-time throughput. Key contributions include EST for endoscopy, a streaming video depth framework with multi-level temporal modeling and self-supervised regularization, and comprehensive evaluations on C3VD and SimCol3D showing improved global geometry, boundary sharpness, and temporal consistency. The approach demonstrates potential to support downstream robotic and image-guided endoscopic automation with robust, streaming depth estimates in challenging near-field scenes.
Abstract
This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth
