EndoPerfect: High-Accuracy Monocular Depth Estimation and 3D Reconstruction for Endoscopic Surgery via NeRF-Stereo Fusion

Pengcheng Chen; Wenhao Li; Nicole Gunderson; Jeremy Ruthberg; Randall Bly; Zhenglong Sun; Waleed M. Abuzeid; Eric J. Seibel

EndoPerfect: High-Accuracy Monocular Depth Estimation and 3D Reconstruction for Endoscopic Surgery via NeRF-Stereo Fusion

Pengcheng Chen, Wenhao Li, Nicole Gunderson, Jeremy Ruthberg, Randall Bly, Zhenglong Sun, Waleed M. Abuzeid, Eric J. Seibel

TL;DR

EndoPerfect tackles the need for fast, radiation-free, submillimeter monocular depth estimation in endoscopic sinus surgery by introducing an iterative NeRF-based pipeline that uses NeRF as an intermediate representation, generates optimized novel stereo views, and applies depth-supervised refinement to produce dense 3D reconstructions. The method achieves point-to-point accuracy below $0.5$ mm and depth accuracy of $0.125 \pm 0.443$ mm, validated across synthetic, phantom, cadaver, and intraoperative data, and demonstrates faster performance than intraoperative CT for 100-frame sequences. Key contributions include a Nerfacto-based NeRF initialization, a gradient-driven novel view optimization that preserves epipolar geometry, and a DS-NeRF-inspired depth supervision loop with progressive baseline growth and geometric fusion. The results suggest EndoPerfect can serve as a practical iCT replacement in ESS, offering submillimeter accuracy with reduced radiation exposure and improved intraoperative efficiency, while future work aims at generalization to larger spaces and near-real-time processing.

Abstract

In endoscopic sinus surgery (ESS), intraoperative CT (iCT) offers valuable intraoperative assessment but is constrained by slow deployment and radiation exposure, limiting its clinical utility. Endoscope-based monocular 3D reconstruction is a promising alternative; however, existing techniques often struggle to achieve the submillimeter precision required for dense reconstruction. In this work, we propose an iterative online learning approach that leverages Neural Radiance Fields (NeRF) as an intermediate representation, enabling monocular depth estimation and 3D reconstruction without relying on prior medical data. Our method attains a point-to-point accuracy below 0.5 mm, with a demonstrated theoretical depth accuracy of 0.125 $\pm$ 0.443 mm. We validate our approach across synthetic, phantom, and real endoscopic scenarios, confirming its accuracy and reliability. These results underscore the potential of our pipeline as an iCT alternative, meeting the demanding submillimeter accuracy standards required in ESS.

EndoPerfect: High-Accuracy Monocular Depth Estimation and 3D Reconstruction for Endoscopic Surgery via NeRF-Stereo Fusion

TL;DR

mm and depth accuracy of

mm, validated across synthetic, phantom, cadaver, and intraoperative data, and demonstrates faster performance than intraoperative CT for 100-frame sequences. Key contributions include a Nerfacto-based NeRF initialization, a gradient-driven novel view optimization that preserves epipolar geometry, and a DS-NeRF-inspired depth supervision loop with progressive baseline growth and geometric fusion. The results suggest EndoPerfect can serve as a practical iCT replacement in ESS, offering submillimeter accuracy with reduced radiation exposure and improved intraoperative efficiency, while future work aims at generalization to larger spaces and near-real-time processing.

Abstract

0.443 mm. We validate our approach across synthetic, phantom, and real endoscopic scenarios, confirming its accuracy and reliability. These results underscore the potential of our pipeline as an iCT alternative, meeting the demanding submillimeter accuracy standards required in ESS.

Paper Structure (12 sections, 4 equations, 3 figures, 3 tables)

This paper contains 12 sections, 4 equations, 3 figures, 3 tables.

Introduction
Method
NeRF initialization:
Novel Stereo Views Generation:
Iterative Refinement:
Experiment
Virtual Endoscopy Experiments:
Phantom Experiments:
Cadaver Experiments:
Clinical intraoperative data analysis:
Ablation study
Conclusion

Figures (3)

Figure 1: Pipeline overview: First, an initial scene is reconstructed using NeRF. Next, optimized novel stereo views are generated within this scene for stereo depth estimation. The resulting depth maps are then used to supervise the subsequent NeRF training round. This iterative process continues until the depths converge, after which the final depth maps are fused for 3D reconstruction.
Figure 2: Comprehensive validation of our method across multiple experiments: Clinical experiment for validating its potential to substitute iCT; 3D Reconstruction vs. CT shows near-perfect alignment along anatomical boundaries; Case demonstrations from endoscopic procedures show high structural fidelity of depth estimates with RGB inputs.
Figure 3: (A) Registration of the cadaver CT scan with our 3D reconstruction. (B) Results on in vivo data, demonstrating that blood and mucus on the surface do not significantly affect reconstruction quality.

EndoPerfect: High-Accuracy Monocular Depth Estimation and 3D Reconstruction for Endoscopic Surgery via NeRF-Stereo Fusion

TL;DR

Abstract

EndoPerfect: High-Accuracy Monocular Depth Estimation and 3D Reconstruction for Endoscopic Surgery via NeRF-Stereo Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)