Table of Contents
Fetching ...

BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications

G. Manni, C. Lauretti, F. Prata, R. Papalia, L. Zollo, P. Soda

TL;DR

BodySLAM is presented, a robust deep learning-based MVSLAM approach that addresses challenges in endoscopic procedures through three key components: CycleVO, a novel unsupervised monocular pose estimation module; the integration of the state-of-the-art Zoe architecture for monocular depth estimation; and a 3D reconstruction module creating a coherent surgical map.

Abstract

Endoscopic surgery relies on two-dimensional views, posing challenges for surgeons in depth perception and instrument manipulation. While Monocular Visual Simultaneous Localization and Mapping (MVSLAM) has emerged as a promising solution, its implementation in endoscopic procedures faces significant challenges due to hardware limitations, such as the use of a monocular camera and the absence of odometry sensors. This study presents BodySLAM, a robust deep learning-based MVSLAM approach that addresses these challenges through three key components: CycleVO, a novel unsupervised monocular pose estimation module; the integration of the state-of-the-art Zoe architecture for monocular depth estimation; and a 3D reconstruction module creating a coherent surgical map. The approach is rigorously evaluated using three publicly available datasets (Hamlyn, EndoSLAM, and SCARED) spanning laparoscopy, gastroscopy, and colonoscopy scenarios, and benchmarked against four state-of-the-art methods. Results demonstrate that CycleVO exhibited competitive performance with the lowest inference time among pose estimation methods, while maintaining robust generalization capabilities, whereas Zoe significantly outperformed existing algorithms for depth estimation in endoscopy. BodySLAM's strong performance across diverse endoscopic scenarios demonstrates its potential as a viable MVSLAM solution for endoscopic applications.

BodySLAM: A Generalized Monocular Visual SLAM Framework for Surgical Applications

TL;DR

BodySLAM is presented, a robust deep learning-based MVSLAM approach that addresses challenges in endoscopic procedures through three key components: CycleVO, a novel unsupervised monocular pose estimation module; the integration of the state-of-the-art Zoe architecture for monocular depth estimation; and a 3D reconstruction module creating a coherent surgical map.

Abstract

Endoscopic surgery relies on two-dimensional views, posing challenges for surgeons in depth perception and instrument manipulation. While Monocular Visual Simultaneous Localization and Mapping (MVSLAM) has emerged as a promising solution, its implementation in endoscopic procedures faces significant challenges due to hardware limitations, such as the use of a monocular camera and the absence of odometry sensors. This study presents BodySLAM, a robust deep learning-based MVSLAM approach that addresses these challenges through three key components: CycleVO, a novel unsupervised monocular pose estimation module; the integration of the state-of-the-art Zoe architecture for monocular depth estimation; and a 3D reconstruction module creating a coherent surgical map. The approach is rigorously evaluated using three publicly available datasets (Hamlyn, EndoSLAM, and SCARED) spanning laparoscopy, gastroscopy, and colonoscopy scenarios, and benchmarked against four state-of-the-art methods. Results demonstrate that CycleVO exhibited competitive performance with the lowest inference time among pose estimation methods, while maintaining robust generalization capabilities, whereas Zoe significantly outperformed existing algorithms for depth estimation in endoscopy. BodySLAM's strong performance across diverse endoscopic scenarios demonstrates its potential as a viable MVSLAM solution for endoscopic applications.
Paper Structure (19 sections, 8 equations, 7 figures, 7 tables)

This paper contains 19 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of (A) a classical MVSLAM system and (B) a fully deep learning-based MVSLAM framework. The traditional approach consists of feature extraction, feature matching, motion estimation, depth map estimation, map building, and optimization components. In contrast, the deep learning-based framework replaces conventional algorithms with deep learning models for feature extraction, feature matching, motion estimation, and depth estimation, potentially simplifying the overall architecture complexity.
  • Figure 2: The approach takes RGB frames as input and outputs a 3D reconstruction of the surgical scene. The MPEM, utilizing CycleVO, computes relative motion between consecutive frames, outputting a motion matrix $M = [R, t_{unscaled}, 1, 0]$, where $R$ is the rotation matrix and $t_{unscaled}$ is the unscaled translation vector. The MDEM estimates depth maps from RGB frames. The Translation Estimation Module estimates a scaled translation vector $t_{scaled}$, which is then combined with $t_{unscaled}$ using an Unscented Kalman Filter to correct the scale of the motion matrix. Finally, the 3DM combines the RGB frames, depth maps, and scaled pose matrices to generate and update the 3D model
  • Figure 3: Integration of Pose Estimation within the Cycle Consistency Framework and the Neural Network Architecture. Left: The diagram illustrates the integration of pose estimation within the Cycle Consistency framework. Generators $\text{Gen}{AB}$ and $\text{Gen}{BA}$ perform domain transformations, while pose networks $P_{AB}$ and $P_{BA}$ predict the relative pose between consecutive frames. The predicted pose $M$ is concatenated with the latent space $z$ to improve image-to-image translation performance. Right: The neural network architecture for pose estimation includes convolutional, downsampling, residual, and upsampling layers. The bottleneck is modified to accommodate pose estimation, where the generator encoder $E$ processes concatenated frames $f_c = [f_{i-1}, f_i]$ and the pose estimation tail $P$ produces the relative motion matrix $M = [R, t_{unscaled}, 1, 0]$, where $R$ is the rotation matrix and $t_{unscaled}$ is the unscaled translation vector. The latent space $z$ and predicted pose $M$ are concatenated and fed into the generator $G$ to produce the next frame $\hat{f}_i$.
  • Figure 4: Comparison of the performance of EndoDepth, EndoSfmLearner, AFSfMlearner, OneSLAM and CycleVO algorithms on the SCARED and ENDOSLAM datasets. Metrics evaluated include Absolute Trajectory Error (ATE), Relative Trajectory Error (RTE), and Relative Rotation Error (RRE). Each box plot shows the distribution of the errors for the respective algorithms, highlighting the median, interquartile range, and outliers.
  • Figure 5: Ablation study of cycle loss term modifications. This figure presents the impact of different cycle loss term settings ( $\lambda_2 \neq 0$ vs $\lambda_2 = 0$) on the performance of models evaluated on the SCARED and ENDOSLAM datasets. The metrics assessed include Absolute Trajectory Error (ATE), Relative Trajectory Error (RTE), and Relative Rotation Error (RRE). Each box plot illustrates the distribution of the errors for the respective models, highlighting the median, interquartile range, and outliers.
  • ...and 2 more figures