Table of Contents
Fetching ...

BASED: Bundle-Adjusting Surgical Endoscopic Dynamic Video Reconstruction using Neural Radiance Fields

Shreya Saha, Zekai Liang, Shan Lin, Jingpei Lu, Michael Yip, Sainan Liu

TL;DR

This work tackles deformable surgical scene reconstruction from monocular endoscopic videos with unknown camera poses. It introduces BASED, a three-part NeRF-based framework comprising a learnable camera pose module, a deformation module, and a canonical NeRF, augmented by tool-mask guided ray casting and depth-guided losses. The approach jointly estimates camera motion, nonrigid tissue deformation, and a canonical 3D representation, and it employs a dynamic multi-view correspondence loss and depth guidance to improve pose and geometry in challenging endoscopic data. Across Hamlyn and EndoNeRF datasets, BASED delivers superior rendering quality and more accurate depth maps than state-of-the-art baselines, demonstrating strong potential for intraoperative navigation and autonomous robotic perception in deformable surgical scenes.

Abstract

Reconstruction of deformable scenes from endoscopic videos is important for many applications such as intraoperative navigation, surgical visual perception, and robotic surgery. It is a foundational requirement for realizing autonomous robotic interventions for minimally invasive surgery. However, previous approaches in this domain have been limited by their modular nature and are confined to specific camera and scene settings. Our work adopts the Neural Radiance Fields (NeRF) approach to learning 3D implicit representations of scenes that are both dynamic and deformable over time, and furthermore with unknown camera poses. We demonstrate this approach on endoscopic surgical scenes from robotic surgery. This work removes the constraints of known camera poses and overcomes the drawbacks of the state-of-the-art unstructured dynamic scene reconstruction technique, which relies on the static part of the scene for accurate reconstruction. Through several experimental datasets, we demonstrate the versatility of our proposed model to adapt to diverse camera and scene settings, and show its promise for both current and future robotic surgical systems.

BASED: Bundle-Adjusting Surgical Endoscopic Dynamic Video Reconstruction using Neural Radiance Fields

TL;DR

This work tackles deformable surgical scene reconstruction from monocular endoscopic videos with unknown camera poses. It introduces BASED, a three-part NeRF-based framework comprising a learnable camera pose module, a deformation module, and a canonical NeRF, augmented by tool-mask guided ray casting and depth-guided losses. The approach jointly estimates camera motion, nonrigid tissue deformation, and a canonical 3D representation, and it employs a dynamic multi-view correspondence loss and depth guidance to improve pose and geometry in challenging endoscopic data. Across Hamlyn and EndoNeRF datasets, BASED delivers superior rendering quality and more accurate depth maps than state-of-the-art baselines, demonstrating strong potential for intraoperative navigation and autonomous robotic perception in deformable surgical scenes.

Abstract

Reconstruction of deformable scenes from endoscopic videos is important for many applications such as intraoperative navigation, surgical visual perception, and robotic surgery. It is a foundational requirement for realizing autonomous robotic interventions for minimally invasive surgery. However, previous approaches in this domain have been limited by their modular nature and are confined to specific camera and scene settings. Our work adopts the Neural Radiance Fields (NeRF) approach to learning 3D implicit representations of scenes that are both dynamic and deformable over time, and furthermore with unknown camera poses. We demonstrate this approach on endoscopic surgical scenes from robotic surgery. This work removes the constraints of known camera poses and overcomes the drawbacks of the state-of-the-art unstructured dynamic scene reconstruction technique, which relies on the static part of the scene for accurate reconstruction. Through several experimental datasets, we demonstrate the versatility of our proposed model to adapt to diverse camera and scene settings, and show its promise for both current and future robotic surgical systems.
Paper Structure (21 sections, 9 equations, 8 figures, 5 tables)

This paper contains 21 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: BASED is a novel NeRF-based method that can be used in dynamic and deformable scenes with unknown camera poses. It can produce novel viewpoint renderings with robust color (left) and depth (right) reconstructions, even from monocular untracked camera images. Comparisons with state-of-the-art methods show that it has leading performance in scene reconstruction. Depth in the ground truth row is the reference depth, estimated from recasens2021endo.
  • Figure 2: Method overview: (A) The first part of the diagram shows a tissue that appears to be dynamic and deformable at three timestamps $t_0$, $t_1$, and $t_2$. Camera poses are estimated by a $N\times9$ learnable matrix through the camera pose estimation layer. $x(t_0), x(t_1)$ and $x(t_2)$ shows the trajectory taken by the same point $x$ on the tissue at these three different timesteps. (B) $x_{t_1}$ and $x_{t_2}$ represents non-rigid deformation trajectory of $x$ at time $t_1$ and $t_2$. The deformation model denoted by$\psi_{\Delta}$ takes in $x_{t_1}$ and $x_{t_2}$ and predicts their displacement from the canonical configuration of the scene. The canonical model $\psi_x$ then takes in the mapped canonical position of the point $x_{t_0} = x_{t_1} + \Delta x_{t_1} = x_{t_2} + \Delta x_{t_2}$ along with 2D camera directions, and predicts the color and density information.
  • Figure 3: Overview of the correspondence loss: Matching pixels are chosen from a random pair of images (having correspondence above a certain threshold) using PDC-Net truong2021learning. The matching pixels are projected from 2D pixel space to their 3D positions in the world frame, and then further passed through the deformation model to get their canonical 3D positions. The difference between the mapped 3D points is used to calculate the flow correspondence loss. The loss is designed to pull the outputs of correspondence points returned by the deformation model closer together towards the same canonical point denoted as $x_0$.
  • Figure 4: Qualitative analysis of our proposed model BASED and EndoNeRF on "Cutting Tissues Twice" dataset for novel view renderings. Highlighted sections show how BASED is able to avoid different artifacts in the images, resulting in a sharper image.
  • Figure 5: Ablation analysis on Hamlyn Rectified 18-1 dataset shows how different losses bring about an improvement in the rendered results from the final model. Please note that the depth in the Ground Truth column is the reference depth.
  • ...and 3 more figures