Table of Contents
Fetching ...

SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation

Peter Siegel, Federico Tombari, Marc Pollefeys, Daniel Barath

TL;DR

The experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially without any per-scene optimization for semantic feature integration.

Abstract

We have introduced SegSplat, a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding. By constructing a compact semantic memory bank from multi-view 2D foundation model features and predicting discrete semantic indices alongside geometric and appearance attributes for each 3D Gaussian in a single pass, SegSplat efficiently imbues scenes with queryable semantics. Our experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially \textit{without} requiring any per-scene optimization for semantic feature integration. This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments, vital for advancing robotic interaction, augmented reality, and other intelligent systems.

SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation

TL;DR

The experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially without any per-scene optimization for semantic feature integration.

Abstract

We have introduced SegSplat, a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding. By constructing a compact semantic memory bank from multi-view 2D foundation model features and predicting discrete semantic indices alongside geometric and appearance attributes for each 3D Gaussian in a single pass, SegSplat efficiently imbues scenes with queryable semantics. Our experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially \textit{without} requiring any per-scene optimization for semantic feature integration. This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments, vital for advancing robotic interaction, augmented reality, and other intelligent systems.

Paper Structure

This paper contains 11 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualization of learned 3D language features of the previous state-of-the-art method, LangSplat qin2024langsplat, and our SegSplat. While LangSplat requires per-scene training and generates imprecise features, our SegSplat captures smooth regions more consistently and needs no training. While being effective, our SegSplat is also 59$\times$ faster than LangSplat.
  • Figure 2: SegSplat predicts 3D Gaussian splats embedded with language features from sparse multi-view images, without any training. Our pipeline leverages pretrained DepthSplat to estimate 3D Gaussian parameters per pixel, and uses SAM+CLIP to extract segmentation masks and CLIP embeddings. To ensure memory efficiency, we construct a CLIP feature memory bank and represent per-object semantics using one-hot index maps aligned with this bank. These semantic indices are appended to the Gaussians predicted by DepthSplat. After splatting, we reconstruct full-length language features via an element-wise product between the rendered index maps and the memory bank. Novel-view querying is then performed on the decoded CLIP feature image.
  • Figure 3: Comparison of predicted and ground truth (GT) color and semantic maps for novel views rendered by SegSplat on the RealEstate10K dataset realestate10k. Semantic maps are visualized using PCA. Ground truth semantic features are obtained by applying SAM+CLIP to the corresponding GT novel view images. Each group of four columns shows: GT RGB image, SegSplat-rendered RGB, GT semantics, and SegSplat-rendered semantics. This sequence is repeated for a second novel view.
  • Figure 4: Comparison of predicted and ground truth (GT) color and semantic maps for novel views rendered by SegSplat on the 3D-OVS dataset liu2023weakly. Semantic maps are visualized using PCA. Ground truth semantic features are obtained by applying SAM+CLIP to the corresponding GT novel view images. Each group of four columns shows: GT RGB image, SegSplat-rendered RGB, GT semantics, and SegSplat-rendered semantics. This sequence is repeated for a second novel view.
  • Figure 5: A qualitative comparison of the masks produced by SegSplat and LangSplat on the 3D-OVS dataset liu2023weakly. We show results for two scenes and two different novel views. We observe that our method produces more accurate segmentation masks.