Table of Contents
Fetching ...

V$^2$-SfMLearner: Learning Monocular Depth and Ego-motion for Multimodal Wireless Capsule Endoscopy

Long Bai, Beilei Cui, Liangyu Wang, Yanheng Li, Shilong Yao, Sishen Yuan, Yanan Wu, Yang Zhang, Max Q. -H. Meng, Zhen Li, Weiping Ding, Hongliang Ren

TL;DR

V$^2$-SfMLearner tackles depth and ego-motion estimation in monocular wireless capsule endoscopy by fusing vision with vibration signals. It introduces a Fourier-based heterogeneous fusion module and an MLSTM-based vibration branch, enabling unsupervised learning of depth $D_t$ and ego-motion $P_{i,i+1}$ without GT supervision. The method demonstrates superior robustness to vibration-induced noise and outperforms vision-only baselines on VR-Caps simulated MM-WCE datasets MM-WCE-1 and MM-WCE-2. The work suggests it can be integrated into capsule robots for real-time clinical screening, with future work on efficiency, Sim2Real generalization, and extension to other endoscopic modalities.

Abstract

Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V$^2$-SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V$^2$-SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors.

V$^2$-SfMLearner: Learning Monocular Depth and Ego-motion for Multimodal Wireless Capsule Endoscopy

TL;DR

V-SfMLearner tackles depth and ego-motion estimation in monocular wireless capsule endoscopy by fusing vision with vibration signals. It introduces a Fourier-based heterogeneous fusion module and an MLSTM-based vibration branch, enabling unsupervised learning of depth and ego-motion without GT supervision. The method demonstrates superior robustness to vibration-induced noise and outperforms vision-only baselines on VR-Caps simulated MM-WCE datasets MM-WCE-1 and MM-WCE-2. The work suggests it can be integrated into capsule robots for real-time clinical screening, with future work on efficiency, Sim2Real generalization, and extension to other endoscopic modalities.

Abstract

Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V-SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V-SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors.

Paper Structure

This paper contains 24 sections, 15 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the vision-vibration framework, against the conventional vision-only depth and ego-motion estimation solution.
  • Figure 2: The network architecture of the V$^2$-SfMLearner framework. Continuous unlabeled images ($I_t$, $I_{t+1}$) and vibration signals (${Vib}_t$, ${Vib}_{t+1}$) are respectively sent to the network for depth estimation, and predict dense disparity maps ($D_t$, $D_{i+1}$). Meanwhile, the network for ego-motion estimation shall predict the relative WCE pose $P_{i,i+1}$. The output of the vibration branch will feed the Fourier heterogeneous (FH) fusion module after each vision encoder block. The predicted depth map is warped based on the WCE ego-motion information to obtain $D^i_{i+1}$. The pixel-wise disparity between $D^i_{i+1}$ and the interpolated depth map $D'_{i+1}$ is calculated by geometry consistency loss. The detailed structures of MLSTM and the depth encoder are presented on the right. FH fusion denotes the Fourier heterogeneous fusion module.
  • Figure 3: Fourier heterogeneous fusion module. The $\mathcal{SNR}$ of the vibration signal is obtained by the MLP after the vibration branch. The visual feature map is output from each vision encoder block and is then Fourier transformed. The $\mathcal{SNR}$ is fed into the visual feature map in the Fourier domain to remove vibration noise. Subsequently, the visual feature map shall feed the following vision encoder block or decoder after the inverse Fourier domain transform.
  • Figure 4: Overview of our datasets. Left: original images; Middle: the depth GT; Right Top: vibration signal example; Right Bottom: camera ego-motion example.
  • Figure 5: The qualitative experimental results of depth estimation of our fusion framework, against vision-only baselines EndoSfMLearner EndoSfMLearner, AF-SfMLearner AF-SfMLearner, RA-Depth RA-Depth, and EndoDAC cui2024endodac. The error heat maps are calculated with the normalized difference between the GT and the predicted depth map. Blue represents low error, and red represents high error.
  • ...and 3 more figures