Table of Contents
Fetching ...

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit

Abstract

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Abstract

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

Paper Structure

This paper contains 56 sections, 7 theorems, 37 equations, 4 figures, 4 tables.

Key Result

lemma 1

For each node $i$, the attention coefficients $\{\alpha_{ij}\}_{j\in\mathcal{N}(i)}$ defined in Eq. eq:degat_agg satisfy: (i) $\alpha_{ij}\ge 0$ for all $j\in\mathcal{N}(i)$; and (ii) $\sum_{j\in\mathcal{N}(i)} \alpha_{ij} = 1$. Equivalently, the induced sparse matrix $\tilde{\mathbf{A}}$ in Eq. eq:

Figures (4)

  • Figure 1: Visualization of DeGAT neighbor aggregation. (a–b) Visualization of neighborhood construction and feature responses in the proposed DeGAT module. $\bigstar$ indicates the centroid and $\circ$ indicates its neighbors. The highlighted $\bigstar$ aggregates informative context even across instrument boundaries, enabling robust feature refinement. (c–d) Depth estimation comparison without (c) and with (d) DeGAT. Incorporating DeGAT yields sharper boundaries and improved structural continuity for both instruments and organs, as shown in the red box.
  • Figure 2: Overview of the EndoVGGT framework. The proposed DeGAT module enhances the features extracted from DINOv2 oquab2023dinov2, and camera tokens interact via both global and within-frame attention mechanisms. The depth maps are predicted using a DPT head ranftl2021vision, and camera poses are predicted by an MLP to reconstruct the input scene, and are constrained by a composite loss introduced in Sec. \ref{['sec:training_setup']}.
  • Figure 3: Experiment results on EndoNeRF and SCARED dataset. "Average" denotes the mean performance across all evaluated subsets.
  • Figure 5: Ablation Study on the number of neighbors $K$ on SCARED dataset.

Theorems & Definitions (14)

  • lemma 1: Row-stochasticity of DeGAT attention
  • proof
  • corollary 1: Convex-hull property of the aggregated message
  • proof
  • corollary 2: Coordinate-wise bounds (min--max property)
  • proof
  • proposition 1: Norm bound of one-hop DeGAT aggregation
  • proof
  • proposition 2: Permutation equivariance of one-hop DeGAT
  • proof
  • ...and 4 more