Table of Contents
Fetching ...

EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

Hao Li, Daiwei Lu, Jiacheng Wang, Robert J. Webster, Ipek Oguz

TL;DR

The paper tackles real-time monocular depth estimation for endoscopic video, where prior streaming approaches either rely on batched frames or heavy diffusion models. It introduces EndoStreamDepth, combining an endoscopy-robust single-frame depth backbone with hierarchical, multi-level temporal Mamba modules, EST augmentation, and multi-term supervision to produce temporally stable, sharp depth maps at real-time throughput. Key contributions include EST for endoscopy, a streaming video depth framework with multi-level temporal modeling and self-supervised regularization, and comprehensive evaluations on C3VD and SimCol3D showing improved global geometry, boundary sharpness, and temporal consistency. The approach demonstrates potential to support downstream robotic and image-guided endoscopic automation with robust, streaming depth estimates in challenging near-field scenes.

Abstract

This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams

TL;DR

The paper tackles real-time monocular depth estimation for endoscopic video, where prior streaming approaches either rely on batched frames or heavy diffusion models. It introduces EndoStreamDepth, combining an endoscopy-robust single-frame depth backbone with hierarchical, multi-level temporal Mamba modules, EST augmentation, and multi-term supervision to produce temporally stable, sharp depth maps at real-time throughput. Key contributions include EST for endoscopy, a streaming video depth framework with multi-level temporal modeling and self-supervised regularization, and comprehensive evaluations on C3VD and SimCol3D showing improved global geometry, boundary sharpness, and temporal consistency. The approach demonstrates potential to support downstream robotic and image-guided endoscopic automation with robust, streaming depth estimates in challenging near-field scenes.

Abstract

This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at https://github.com/MedICL-VU/EndoStreamDepth

Paper Structure

This paper contains 33 sections, 15 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the EndoStreamDepth framework. (a) Endoscopy-specific Transformation (EST) is applied to model typical variations in endoscopy. (b) The single-frame depth network predicts depth map $\hat{D}_t$ from frame $I_t$. (c) The video stream depth network further incorporates Mamba modules that receive hidden states $H_{t-1}, H_t$ to propagate information across frames, improving depth predictions. Frames are processed sequentially (streaming), not simultaneously.
  • Figure 2: Multi-level temporal Mamba integration within the decoder. For each decoder level $l$, the feature tokens of the current frame $I_t$ are fused with the $l-1$ features and passed through a Mamba module that receives the hidden state $H^{(l)}_{t-1}$ as temporal context. The module outputs an updated hidden state $H^{(l)}_{t}$, which is propagated to the next frame $I_{t+1}$. The right panel illustrates a single Mamba module implemented as a stack of Mamba blocks with state-space model (SSM) layers, each maintaining a recurrent hidden state $h_t$ that is updated to $h_{t+1}$ at the next time step. For brevity, we denote these internal SSM states by $h_t$ without block indices. They are distinct from the decoder-level states $H_t^{(l)}$, and each SSM layer passes its own hidden state to the corresponding layer at the next time step. Decoder processes for $I_{t-1}$ and $I_{t+1}$ are identical to that for $I_t$ and are omitted.
  • Figure 3: Qualitative results. From top to bottom: AbsRel error maps, predicted depth maps, and edge maps derived from the predicted depth map, cropped to the dashed line for visualization. The yellow and green arrows indicate the far- and near-range errors. Red arrows highlight a defect on the edge maps.
  • Figure 4: Per-video temporal variance and runtime on the C3VD dataset. For each sequence, bars show the frame-variance score $\sigma$ (left axis) and the inference speed in FPS (right axis) for our method and FlashDepth. Our model has smaller variances than FlashDepth, with the tradeoff of lower FPS.
  • Figure 5:
  • ...and 2 more figures