4D Monocular Surgical Reconstruction under Arbitrary Camera Motions
Jiwei Shan, Zeyu Cai, Cheng-Tai Hsieh, Yirui Li, Hao Liu, Lijun Han, Hesheng Wang, Shing Shin Cheng
TL;DR
Local-EndoGS introduces a scalable 4D reconstruction framework for monocular endoscopy under arbitrary camera motion by decomposing long sequences into adaptive local windows and modeling each window with a local 3D Gaussian Splatting representation plus a deformation field. A coarse-to-fine initialization combines multi-view geometry, cross-window propagation, and monocular depth priors to initialize local canonical spaces without stereo depth or COLMAP, while long-range 2D pixel trajectories and physics-based regularization enforce plausible tissue motion. Experimental results across EndoNeRF, StereoMIS, and EndoMapper demonstrate superior appearance and geometric reconstruction compared to state-of-the-art methods, with ablations confirming the contribution of PWGSR, LCSI, TAP-based correspondences, and region refinement. The approach offers a practical path toward high-quality, monocular-endoscopy–driven 4D reconstruction suitable for surgical planning and education, while acknowledging limitations in multi-view consistency of Gaussian representations and offline processing. Future work includes real-time deployment, parallelized training over multiple windows, and enhanced correspondences to better handle topological changes.
Abstract
Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.
