BASED: Bundle-Adjusting Surgical Endoscopic Dynamic Video Reconstruction using Neural Radiance Fields
Shreya Saha, Zekai Liang, Shan Lin, Jingpei Lu, Michael Yip, Sainan Liu
TL;DR
This work tackles deformable surgical scene reconstruction from monocular endoscopic videos with unknown camera poses. It introduces BASED, a three-part NeRF-based framework comprising a learnable camera pose module, a deformation module, and a canonical NeRF, augmented by tool-mask guided ray casting and depth-guided losses. The approach jointly estimates camera motion, nonrigid tissue deformation, and a canonical 3D representation, and it employs a dynamic multi-view correspondence loss and depth guidance to improve pose and geometry in challenging endoscopic data. Across Hamlyn and EndoNeRF datasets, BASED delivers superior rendering quality and more accurate depth maps than state-of-the-art baselines, demonstrating strong potential for intraoperative navigation and autonomous robotic perception in deformable surgical scenes.
Abstract
Reconstruction of deformable scenes from endoscopic videos is important for many applications such as intraoperative navigation, surgical visual perception, and robotic surgery. It is a foundational requirement for realizing autonomous robotic interventions for minimally invasive surgery. However, previous approaches in this domain have been limited by their modular nature and are confined to specific camera and scene settings. Our work adopts the Neural Radiance Fields (NeRF) approach to learning 3D implicit representations of scenes that are both dynamic and deformable over time, and furthermore with unknown camera poses. We demonstrate this approach on endoscopic surgical scenes from robotic surgery. This work removes the constraints of known camera poses and overcomes the drawbacks of the state-of-the-art unstructured dynamic scene reconstruction technique, which relies on the static part of the scene for accurate reconstruction. Through several experimental datasets, we demonstrate the versatility of our proposed model to adapt to diverse camera and scene settings, and show its promise for both current and future robotic surgical systems.
