Table of Contents
Fetching ...

4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

Jiwei Shan, Zeyu Cai, Cheng-Tai Hsieh, Yirui Li, Hao Liu, Lijun Han, Hesheng Wang, Shing Shin Cheng

TL;DR

Local-EndoGS introduces a scalable 4D reconstruction framework for monocular endoscopy under arbitrary camera motion by decomposing long sequences into adaptive local windows and modeling each window with a local 3D Gaussian Splatting representation plus a deformation field. A coarse-to-fine initialization combines multi-view geometry, cross-window propagation, and monocular depth priors to initialize local canonical spaces without stereo depth or COLMAP, while long-range 2D pixel trajectories and physics-based regularization enforce plausible tissue motion. Experimental results across EndoNeRF, StereoMIS, and EndoMapper demonstrate superior appearance and geometric reconstruction compared to state-of-the-art methods, with ablations confirming the contribution of PWGSR, LCSI, TAP-based correspondences, and region refinement. The approach offers a practical path toward high-quality, monocular-endoscopy–driven 4D reconstruction suitable for surgical planning and education, while acknowledging limitations in multi-view consistency of Gaussian representations and offline processing. Future work includes real-time deployment, parallelized training over multiple windows, and enhanced correspondences to better handle topological changes.

Abstract

Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.

4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

TL;DR

Local-EndoGS introduces a scalable 4D reconstruction framework for monocular endoscopy under arbitrary camera motion by decomposing long sequences into adaptive local windows and modeling each window with a local 3D Gaussian Splatting representation plus a deformation field. A coarse-to-fine initialization combines multi-view geometry, cross-window propagation, and monocular depth priors to initialize local canonical spaces without stereo depth or COLMAP, while long-range 2D pixel trajectories and physics-based regularization enforce plausible tissue motion. Experimental results across EndoNeRF, StereoMIS, and EndoMapper demonstrate superior appearance and geometric reconstruction compared to state-of-the-art methods, with ablations confirming the contribution of PWGSR, LCSI, TAP-based correspondences, and region refinement. The approach offers a practical path toward high-quality, monocular-endoscopy–driven 4D reconstruction suitable for surgical planning and education, while acknowledging limitations in multi-view consistency of Gaussian representations and offline processing. Future work includes real-time deployment, parallelized training over multiple windows, and enhanced correspondences to better handle topological changes.

Abstract

Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.
Paper Structure (29 sections, 19 equations, 9 figures, 8 tables)

This paper contains 29 sections, 19 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (a)--(c) Illustrations of three typical types of camera motion in surgical scenes: (a) fixed camera, (b) camera moving around the tissue, and (c) camera moving forward. (d) Monocular reconstruction results of different methods under camera motion. State-of-the-art 4D surgical reconstruction algorithms experience significant degradation in reconstruction quality when the camera moves, while our method maintains superior performance.
  • Figure 2: Overview of Local-EndoGS. Given a long monocular endoscopic sequence with arbitrary camera motion, Local-EndoGS reconstructs the entire deformable scene using a progressive window-based global scene representation (\ref{['sec:Progressive']}). Specifically, the sequence is first divided into multiple local windows based on its dynamic characteristics. For each local window, the scene structure is initialized using a local canonical space initialization strategy (\ref{['sec:init']}). Then, a local deformable scene representation (\ref{['sec:local']}) is used to model each region. The parameters of each local scene representation are then optimized using carefully designed loss functions (\ref{['sec:loss']}) in a progressive manner until all local models are fully optimized.
  • Figure 3: Comparison of Feature Matching and Point Cloud Results Using Traditional Methods and Track-Any-Point (TAP) model chen2024leap. (a) Correspondences obtained using SIFT keypoints with brute-force matching. (b) Sparse point cloud reconstructed from the correspondences shown in (a). (c) Correspondences from the TAP model. (d) Dense point cloud from TAP-based correspondences. Green and red lines indicate correct and incorrect feature matches, respectively.
  • Figure 4: Visualization of RGB images rendered by 3DGS $\phi^C_i$ initialized from the coarse stage and their corresponding reconstruction error maps with respect to the ground truth. Top row: rendered RGB images. Bottom row: pixel-wise reconstruction error maps, where higher values indicate greater reconstruction errors, especially near tissue boundaries and regions with deformation.
  • Figure 5: Qualitative comparison of image rendering and depth prediction on deformable scenes at different time points from the StereoMIS dataset. Each pair of rows represents a specific time point: the first row shows the rendered RGB images, and the second row shows the predicted depth maps. The figure presents results from three time points (from top to bottom), illustrating how the observed scene changes as the camera moves. Compared to existing methods, Local-EndoGS (Ours) consistently provides finer reconstruction details, while the baseline methods show limitations in both image quality and depth accuracy.
  • ...and 4 more figures