Table of Contents
Fetching ...

Learnable Patchmatch and Self-Teaching for Multi-Frame Depth Estimation in Monocular Endoscopy

Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, Zhong Liu

TL;DR

This work tackles unsupervised monocular depth estimation in endoscopy by exploiting temporal information from multiple frames. It introduces a learnable patchmatch module with adaptive propagation, augmented by cross-teaching and self-teaching regularizations to cope with low-texture tissues and brightness fluctuations typical of endoscopic scenes. The approach uses a three-branch architecture (DepthNet, cross-teaching, self-teaching) and learns flexible depth ranges, achieving state-of-the-art results on SCARED, EndoSLAM, Hamlyn, and SERV-CT, while enabling performance gains as more frames are input at test time. The method generalizes well across datasets and offers practical advantages for surgical navigation and real-time depth-based guidance in endoscopy.

Abstract

This work delves into unsupervised monocular depth estimation in endoscopy, which leverages adjacent frames to establish a supervisory signal during the training phase. For many clinical applications, e.g., surgical navigation, temporally correlated frames are also available at test time. Due to the lack of depth clues, making full use of the temporal correlation among multiple video frames at both phases is crucial for accurate depth estimation. However, several challenges in endoscopic scenes, such as low and homogeneous textures and inter-frame brightness fluctuations, limit the performance gain from the temporal correlation. To fully exploit it, we propose a novel unsupervised multi-frame monocular depth estimation model. The proposed model integrates a learnable patchmatch module to adaptively increase the discriminative ability in regions with low and homogeneous textures, and enforces cross-teaching and self-teaching consistencies to provide efficacious regularizations towards brightness fluctuations. Furthermore, as a byproduct of the self-teaching paradigm, the proposed model is able to improve the depth predictions when more frames are input at test time. We conduct detailed experiments on multiple datasets, including SCARED, EndoSLAM, Hamlyn and SERV-CT. The experimental results indicate that our model exceeds the state-of-the-art competitors. The source code and trained models will be publicly available upon the acceptance.

Learnable Patchmatch and Self-Teaching for Multi-Frame Depth Estimation in Monocular Endoscopy

TL;DR

This work tackles unsupervised monocular depth estimation in endoscopy by exploiting temporal information from multiple frames. It introduces a learnable patchmatch module with adaptive propagation, augmented by cross-teaching and self-teaching regularizations to cope with low-texture tissues and brightness fluctuations typical of endoscopic scenes. The approach uses a three-branch architecture (DepthNet, cross-teaching, self-teaching) and learns flexible depth ranges, achieving state-of-the-art results on SCARED, EndoSLAM, Hamlyn, and SERV-CT, while enabling performance gains as more frames are input at test time. The method generalizes well across datasets and offers practical advantages for surgical navigation and real-time depth-based guidance in endoscopy.

Abstract

This work delves into unsupervised monocular depth estimation in endoscopy, which leverages adjacent frames to establish a supervisory signal during the training phase. For many clinical applications, e.g., surgical navigation, temporally correlated frames are also available at test time. Due to the lack of depth clues, making full use of the temporal correlation among multiple video frames at both phases is crucial for accurate depth estimation. However, several challenges in endoscopic scenes, such as low and homogeneous textures and inter-frame brightness fluctuations, limit the performance gain from the temporal correlation. To fully exploit it, we propose a novel unsupervised multi-frame monocular depth estimation model. The proposed model integrates a learnable patchmatch module to adaptively increase the discriminative ability in regions with low and homogeneous textures, and enforces cross-teaching and self-teaching consistencies to provide efficacious regularizations towards brightness fluctuations. Furthermore, as a byproduct of the self-teaching paradigm, the proposed model is able to improve the depth predictions when more frames are input at test time. We conduct detailed experiments on multiple datasets, including SCARED, EndoSLAM, Hamlyn and SERV-CT. The experimental results indicate that our model exceeds the state-of-the-art competitors. The source code and trained models will be publicly available upon the acceptance.
Paper Structure (31 sections, 14 equations, 14 figures, 9 tables)

This paper contains 31 sections, 14 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Our method, which trains and tests on endoscopic video streams rather than isolated images, produces accurate depth predictions.
  • Figure 2: Main challenges encountered in endoscopic scenes. (a) Low texture. The difference map is acquired by taking an absolute difference between frames $\tau$ and $\tau - 1$, to indicate the low-texture regions. (b) Inter-frame brightness fluctuations.
  • Figure 3: Overview of the whole framework. In the training phase, our framework leverages a target frame and a source frame to build cost volume, and includes three branches, namely self-teaching, depth estimation and cross-teaching. During the evaluation phase, the depth estimation branch is used to produce final results, which allows input of more frames to improve the depth predictions.
  • Figure 4: 3D Illustration of the depth planes in plane-sweeping, with each plane perpendicular to the optical axis.
  • Figure 5: Illustration of the keypoints, which are extracted by algorithms of (a) direct sparse odometry (DSO) engel2017direct, (b) scale-invariant feature transform (SIFT) lowe2004distinctive and (c) oriented fast and rotated brief (ORB) rublee2011orb, and are marked with green circles. The keypoints of DSO are more even and dense.
  • ...and 9 more figures