Table of Contents
Fetching ...

SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, Yebin Liu

TL;DR

SemanticSplat presents a fast, feed-forward framework that unifies 3D Gaussian splatting with latent semantic attributes to achieve holistic 3D scene understanding from sparse views. By multi-view feature fusion with cost volumes and a two-stage distillation of SAM and CLIP-LSeg signals, it reconstructs a semantic-aware 3D feature field capable of novel view synthesis, depth prediction, promptable segmentation, and open-vocabulary segmentation. Key contributions include semantic anisotropic Gaussians, multi-conditioned semantic feature aggregation, segmentation and language feature distillation, and hierarchical pooling, delivering competitive results while maintaining real-time inference advantages over per-scene optimization methods. The approach enables practical, scalable open-world 3D semantics with robust cross-view consistency and semantic alignment across modalities.

Abstract

Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.

SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

TL;DR

SemanticSplat presents a fast, feed-forward framework that unifies 3D Gaussian splatting with latent semantic attributes to achieve holistic 3D scene understanding from sparse views. By multi-view feature fusion with cost volumes and a two-stage distillation of SAM and CLIP-LSeg signals, it reconstructs a semantic-aware 3D feature field capable of novel view synthesis, depth prediction, promptable segmentation, and open-vocabulary segmentation. Key contributions include semantic anisotropic Gaussians, multi-conditioned semantic feature aggregation, segmentation and language feature distillation, and hierarchical pooling, delivering competitive results while maintaining real-time inference advantages over per-scene optimization methods. The approach enables practical, scalable open-world 3D semantics with robust cross-view consistency and semantic alignment across modalities.

Abstract

Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.

Paper Structure

This paper contains 39 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our approach utilizes sparse view images as input to reconstruct a holistic semantic Gaussian field, which includes both the Gaussian field with language features and the segmentation features. This reconstruction captures geometry, appearance, and multi-modal semantics, enabling us to perform multiple tasks such as novel view synthesis, depth prediction, open-vocabulary segmentation, and promptable segmentation.
  • Figure 2: We employ multiview transformers with cross-attention to extract features from multi-view images and use cost volumes for feature matching (see Sec. \ref{['3.1']}). Utilizing the multi-conditioned semantic features from Visual Feature Modules (VFMs) aggregated with the cost volumes (see Sec. \ref{['3.2']}), we predict Semantic Anisotropic Gaussians(see Sec. \ref{['3.3']}). Through a two-stage feature distillation process involving both segmentation(see Sec. \ref{['sec:sam_distill']}) and language features(see Sec. \ref{['sec:lseg_distill']}), we reconstruct the holistic semantic feature field by jointly enforcing photometric fidelity and semantic consistency.
  • Figure 3: Novel View Synthesis Comparisons. Our method outperforms LSM and Feature-3DGS in challenging regions and is compatible with baseline MVSplat, which shows we reconstruct the appearance successfully
  • Figure 4: Language-based Segmentation Comparison. We visualize the segmentation from a set of categories for unseen view, our method outperforms with the other 3D method and comparably to the 2D VFMs, which indicates we effictively lift 2d foundation language-image model to 3D.
  • Figure 5: Visualization of the Semantic Feature Field. We visualize the language features and segmentation characteristics of the novel views, demonstrating how we elevate the 2D features into 3D while maintaining consistency across views. The visualizations are generated using PCA pedregosa2011scikit.
  • ...and 2 more figures