EndoSERV: A Vision-based Endoluminal Robot Navigation System

Junyang Wu; Fangfang Xie; Minghui Zhang; Hanxiao Zhang; Jiayuan Sun; Yun Gu; Guang-Zhong Yang

EndoSERV: A Vision-based Endoluminal Robot Navigation System

Junyang Wu, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

TL;DR

A novel EndoSERV localization method for long-range and complex luminal structures that divides them into smaller sub-segments and estimates the odometry independently, and an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth.

Abstract

Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.

EndoSERV: A Vision-based Endoluminal Robot Navigation System

TL;DR

Abstract

Paper Structure (38 sections, 11 equations, 10 figures, 5 tables)

This paper contains 38 sections, 11 equations, 10 figures, 5 tables.

Introduction
Related work
SfM-based method
Neural Implicit-based SLAM
Real-Virtual Alignment
System overview
Sliding Window Buffering
Subsegment system pipeline
Offline Training
Online Training
Testing Phase
Training pipeline
Offline Training: Extracting Robust Features
Online Training: Adaptation to Real-World Scenarios
Virtual Buffer Retrieval
...and 23 more sections

Figures (10)

Figure 1: (a). Scale ambiguity in monocular SLAM: During the testing phase, the results of monocular SLAM require alignment with ground truth trajectories, which is impractical in clinical applications due to the lack of absolute scale information. (b). Appearance similarity in endoscopic images: Different bronchial branches often exhibit nearly identical geometric and topological structures, which can lead to ambiguous feature matching and incorrect associations with virtual frames.
Figure 2: Motivations of this work.Segment-to-Structure: Due to the complex and long-term structure in the luminal path, a divide-and-conquer strategy is proposed for long-term pose estimation. Real-to-Virtual: Due to the lack of real pose labels in clinical scenarios, in this work, pre-operative CT data are used as the structure prior for intra-operative odometry estimation.
Figure 3: System overview of EndoSERV. (a). A sliding windows strategy for long-term pose estimation. Black windows denote the training images, while pink windows represent the testing images. The sliding windows move along the temporal axis, enabling the system to alternate between training buffers and testing buffers. (b). The detail pipeline within a sub-segment, consisting of offline training, online training, and testing phase.
Figure 4: Training pipeline overview. (a). Offline training pipeline. Virtual images are augmented using the pretrained diffusion model, generating texture-diverse augmented images. A novel aligner is designed to constrain the feature encoder to extract features into a unified feature space. Scene coordinate head is designed to generate the scene coordinate map. (b). All virtual images are first compressed to the virtual buffer during a retrieval process, which is used to fine-tune the transfer model quickly with the real buffer. An augmentation-then-recovery strategy is proposed to refine the distortion and deformation issue. After aligning everything to the virtual domain, a scene coordinate head is trained to estimate the camera pose.
Figure 5: DDAug framework. The real image is generated from the virtual image using the pretrained transfer model. Three augmentations are applied: Color jitter, mixup with the noisy image, and camera parameter perturbation.
...and 5 more figures

EndoSERV: A Vision-based Endoluminal Robot Navigation System

TL;DR

Abstract

EndoSERV: A Vision-based Endoluminal Robot Navigation System

Authors

TL;DR

Abstract

Table of Contents

Figures (10)