SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Kaiyuan Xu; Fangzhou Hong; Daniel Elson; Baoru Huang

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Kaiyuan Xu, Fangzhou Hong, Daniel Elson, Baoru Huang

TL;DR

This work proposes SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain and introduces a hybrid supervision strategy that couples the authors' pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections.

Abstract

Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments. Project page: https://chumo-xu.github.io/SurgCUT3R-ICRA26/.

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

TL;DR

Abstract

Paper Structure (29 sections, 7 equations, 5 figures, 3 tables)

This paper contains 29 sections, 7 equations, 5 figures, 3 tables.

INTRODUCTION
Related Work
Learning-Based Dense Correspondence for Reconstruction
SLAM-based Methods for Long-Sequence Consistency
Reconstruction for Surgical Scenes
Methodology
Preliminaries: CUT3Rcut3r
Proposed Method: SurgCUT3R
Pseudo-GT Generation for Surgical Scenes
Stereo Preprocessing and Rectification.
Metric-Scale Depth Synthesis and Dataset Assembly.
Hybrid Supervision for Robust Training
Supervised Terms ($\mathcal{L}_{\text{conf}}$ and $\mathcal{L}_{\text{pose}}$).
Self-Supervised Term ($\mathcal{L}_{\text{consistency}}$).
Hierarchical Framework for Long-Sequence Inference
...and 14 more sections

Figures (5)

Figure 1: Qualitative results of 3D reconstruction. With videos (small images) as input, this figure shows the reconstruction from the first frame (large images left) and the accumulated 3D model from multiple frames (large images right). This alignment between the single-frame and multi-frame reconstruction results highlights the geometric consistency of our method.
Figure 2: Overview of SurgCUT3R.Left: The unified reconstruction pipeline. Streaming video frames are encoded via a ViT encoder and interact with a persistent state, which is continuously updated to sequentially output the pointmap and camera parameter for each frame. Right: Our hierarchical framework for long-sequence inference. The pink lines represent camera trajectories. A sparse but globally stable trajectory from a global model ($M_{global}$) provides anchor points to correct and stitch the dense but locally drifting trajectories from a local model ($M_{local}$), producing a final, drift-corrected trajectory.
Figure 3: Pipeline of pseudo GT depth generation. The process rectifies the raw stereo pair before feeding it into FoundationStereo foundationstereo to generate a geometrically correct and metric-scale depth map.
Figure 4: Qualitative results of monocular depth estimation. We compare our method with MonST3Rmonst3r, Spann3Rspann3r, AF-SfMLearnerafsfm, EndoDACendodac and MegaSaMmegasam on SCAREDscared and StereoMISstereomis datasets. Our method achieves the best qualitative results in feed-forward methods.
Figure 5: Qualitative comparison of camera trajectories. Left: Without the hierarchical inference framework. Right: With our hierarchical inference framework (Ours).

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

TL;DR

Abstract

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)