Table of Contents
Fetching ...

Endo-4DGS: Endoscopic Monocular Scene Reconstruction with 4D Gaussian Splatting

Yiming Huang, Beilei Cui, Long Bai, Ziqi Guo, Mengya Xu, Mobarakol Islam, Hongliang Ren

TL;DR

Endo-4DGS tackles real-time reconstruction of deformable endoscopic scenes where monocular depth priors are noisy or unavailable. It introduces 4D Gaussian Splatting, combining a static 3D Gaussian with learned temporal deformation via a spatial-temporal encoder and a deformation network, initialized from Depth-Anything-based depth priors. The approach incorporates confidence-guided learning, depth regularization, and surface normal constraints to robustly leverage pseudo-depth in monocular endoscopy, enabling fast rendering and improved geometry. Empirical results on StereoMIS and EndoNeRF datasets show real-time performance (up to 100 FPS) with competitive or superior reconstruction quality and reduced training requirements, highlighting its practicality for robot-assisted surgery.

Abstract

In the realm of robot-assisted minimally invasive surgery, dynamic scene reconstruction can significantly enhance downstream tasks and improve surgical outcomes. Neural Radiance Fields (NeRF)-based methods have recently risen to prominence for their exceptional ability to reconstruct scenes but are hampered by slow inference speed, prolonged training, and inconsistent depth estimation. Some previous work utilizes ground truth depth for optimization but is hard to acquire in the surgical domain. To overcome these obstacles, we present Endo-4DGS, a real-time endoscopic dynamic reconstruction approach that utilizes 3D Gaussian Splatting (GS) for 3D representation. Specifically, we propose lightweight MLPs to capture temporal dynamics with Gaussian deformation fields. To obtain a satisfactory Gaussian Initialization, we exploit a powerful depth estimation foundation model, Depth-Anything, to generate pseudo-depth maps as a geometry prior. We additionally propose confidence-guided learning to tackle the ill-pose problems in monocular depth estimation and enhance the depth-guided reconstruction with surface normal constraints and depth regularization. Our approach has been validated on two surgical datasets, where it can effectively render in real-time, compute efficiently, and reconstruct with remarkable accuracy.

Endo-4DGS: Endoscopic Monocular Scene Reconstruction with 4D Gaussian Splatting

TL;DR

Endo-4DGS tackles real-time reconstruction of deformable endoscopic scenes where monocular depth priors are noisy or unavailable. It introduces 4D Gaussian Splatting, combining a static 3D Gaussian with learned temporal deformation via a spatial-temporal encoder and a deformation network, initialized from Depth-Anything-based depth priors. The approach incorporates confidence-guided learning, depth regularization, and surface normal constraints to robustly leverage pseudo-depth in monocular endoscopy, enabling fast rendering and improved geometry. Empirical results on StereoMIS and EndoNeRF datasets show real-time performance (up to 100 FPS) with competitive or superior reconstruction quality and reduced training requirements, highlighting its practicality for robot-assisted surgery.

Abstract

In the realm of robot-assisted minimally invasive surgery, dynamic scene reconstruction can significantly enhance downstream tasks and improve surgical outcomes. Neural Radiance Fields (NeRF)-based methods have recently risen to prominence for their exceptional ability to reconstruct scenes but are hampered by slow inference speed, prolonged training, and inconsistent depth estimation. Some previous work utilizes ground truth depth for optimization but is hard to acquire in the surgical domain. To overcome these obstacles, we present Endo-4DGS, a real-time endoscopic dynamic reconstruction approach that utilizes 3D Gaussian Splatting (GS) for 3D representation. Specifically, we propose lightweight MLPs to capture temporal dynamics with Gaussian deformation fields. To obtain a satisfactory Gaussian Initialization, we exploit a powerful depth estimation foundation model, Depth-Anything, to generate pseudo-depth maps as a geometry prior. We additionally propose confidence-guided learning to tackle the ill-pose problems in monocular depth estimation and enhance the depth-guided reconstruction with surface normal constraints and depth regularization. Our approach has been validated on two surgical datasets, where it can effectively render in real-time, compute efficiently, and reconstruct with remarkable accuracy.
Paper Structure (9 sections, 9 equations, 3 figures, 3 tables)

This paper contains 9 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Ground truth reference, estimated depth from Depth-Anything; 3D textures, rendered image, and predicted depth of our proposed method.
  • Figure 2: Illustration of our proposed Endo-4DGS framework. We utilize monocular images, estimated depths from Depth-Anything, and surgical tool masks for training. 3D Guassian is represented as $\mathcal{G}$ with position mean $\mu$, rotation $\mathbf{R}$, scaling $\mathbf{S}$ opacity $o$, and spherical harmonics $\mathbf{SH}$. 4D Gaussian is described as $\mathcal{G}^\prime=\mathcal{G}+\Delta\mathcal{G}$. $\mathcal{L}_{color}, \mathcal{L}_{con}, \mathcal{L}_{depth}, \mathcal{L}_{surf}, \mathcal{L}_{tv}$ are the color loss, confidence loss, depth regularization loss, surface normal loss and total-variational loss, respectively.
  • Figure 3: Qualitative comparison on the EndoNeRF dataset wang2022neural against EndoNeRF wang2022neural, EndoSurf zha2023endosurf, and LerPlane yang2023neural.