Table of Contents
Fetching ...

Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction

Kai Li, Junhao Wang, William Han, Ding Zhao

TL;DR

FE-4DGS addresses the challenge of real-time reconstruction and semantic segmentation of deformable surgical scenes by distilling 2D semantic features into a 4D Gaussian Splatting framework. It introduces a Feature-Spatiotemporal deformation module to update per-Gaussian geometry and semantics, and uses differentiable rendering with a semantic alignment loss to fuse geometry with SAM-derived semantics. The approach achieves state-of-the-art rendering fidelity on EndoNeRF and SCARED while delivering real-time performance, and demonstrates strong binary segmentation and competitive multi-label segmentation on EndoVis18. This work enables unified reconstruction and segmentation in MIS, with implications for AR guidance and potential for language-guided editing in future systems.

Abstract

Minimally invasive surgery (MIS) requires high-fidelity, real-time visual feedback of dynamic and low-texture surgical scenes. To address these requirements, we introduce FeatureEndo-4DGS (FE-4DGS), the first real time pipeline leveraging feature-distilled 4D Gaussian Splatting for simultaneous reconstruction and semantic segmentation of deformable surgical environments. Unlike prior feature-distilled methods restricted to static scenes, and existing 4D approaches that lack semantic integration, FE-4DGS seamlessly leverages pre-trained 2D semantic embeddings to produce a unified 4D representation-where semantics also deform with tissue motion. This unified approach enables the generation of real-time RGB and semantic outputs through a single, parallelized rasterization process. Despite the additional complexity from feature distillation, FE-4DGS sustains real-time rendering (61 FPS) with a compact footprint, achieves state-of-the-art rendering fidelity on EndoNeRF (39.1 PSNR) and SCARED (27.3 PSNR), and delivers competitive EndoVis18 segmentation, matching or exceeding strong 2D baselines for binary segmentation tasks (0.93 DSC) and remaining competitive for multi-label segmentation (0.77 DSC).

Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction

TL;DR

FE-4DGS addresses the challenge of real-time reconstruction and semantic segmentation of deformable surgical scenes by distilling 2D semantic features into a 4D Gaussian Splatting framework. It introduces a Feature-Spatiotemporal deformation module to update per-Gaussian geometry and semantics, and uses differentiable rendering with a semantic alignment loss to fuse geometry with SAM-derived semantics. The approach achieves state-of-the-art rendering fidelity on EndoNeRF and SCARED while delivering real-time performance, and demonstrates strong binary segmentation and competitive multi-label segmentation on EndoVis18. This work enables unified reconstruction and segmentation in MIS, with implications for AR guidance and potential for language-guided editing in future systems.

Abstract

Minimally invasive surgery (MIS) requires high-fidelity, real-time visual feedback of dynamic and low-texture surgical scenes. To address these requirements, we introduce FeatureEndo-4DGS (FE-4DGS), the first real time pipeline leveraging feature-distilled 4D Gaussian Splatting for simultaneous reconstruction and semantic segmentation of deformable surgical environments. Unlike prior feature-distilled methods restricted to static scenes, and existing 4D approaches that lack semantic integration, FE-4DGS seamlessly leverages pre-trained 2D semantic embeddings to produce a unified 4D representation-where semantics also deform with tissue motion. This unified approach enables the generation of real-time RGB and semantic outputs through a single, parallelized rasterization process. Despite the additional complexity from feature distillation, FE-4DGS sustains real-time rendering (61 FPS) with a compact footprint, achieves state-of-the-art rendering fidelity on EndoNeRF (39.1 PSNR) and SCARED (27.3 PSNR), and delivers competitive EndoVis18 segmentation, matching or exceeding strong 2D baselines for binary segmentation tasks (0.93 DSC) and remaining competitive for multi-label segmentation (0.77 DSC).

Paper Structure

This paper contains 31 sections, 12 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Overview of FE-4DGS. The pipeline begins with holistic Gaussian initialization, re-projecting image pixels into 3D Gaussians. In the FST deformation module, a 4D voxel encoder extracts latent features, which a deformation decoder refines by updating Gaussian parameters and semantics via a lightweight MLP (Section \ref{['sec:fst']}, Appendix \ref{['apd:fst']}). A differentiable rasterizer renders the updated Gaussians into radiance and semantic maps, which a CNN decoder upsamples and aligns with features from a 2D segmentation model (SAM) to ensure semantic consistency.
  • Figure 2: Qualitative segmentation comparisons. Colors correspond to the following classes: kidney (blue), small intestine (orange), instrument shaft (red), instrument clasper (yellow), instrument wrist (purple), and clamps (green). Notably, FE-4DGS (ViT-H) and SAM (ViT-H) exhibit clean multi-label segmentations, while other models struggle with finer labels such as clamps or instrument claspers.
  • Figure 3: Comparison of qualitative renderings between EndoGaussian liu2024endogaussianrealtimegaussiansplatting, FE-4DGS, and ground truth on the cutting (right) and pulling (left) sets of the EndoNeRF dataset wang2022neuralrenderingstereo3d. Compared to previous methods, specular regions are reconstructed more finely in FE-4DGS.
  • Figure 4: Overview of the FST feature deformation module found in FE-4DGS, it includes the HexPlanes cao2023hexplanefastrepresentationdynamic and requires decoder architectures for updating position, rotation, scale, opacity, and semantic features of Gaussians.
  • Figure 5: Comparison of qualitative renderings between EndoGaussian liu2024endogaussianrealtimegaussiansplatting, FE-4DGS, and ground truth on the EndoVis18 dataset allan20202018roboticscenesegmentation. We can see that across all scenes, FE-4DGS and EndoGaussian experience similar performances, although FE-4DGS is able to capture semantic features which can later be used for segmentation.