SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields
Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, Yebin Liu
TL;DR
SemanticSplat presents a fast, feed-forward framework that unifies 3D Gaussian splatting with latent semantic attributes to achieve holistic 3D scene understanding from sparse views. By multi-view feature fusion with cost volumes and a two-stage distillation of SAM and CLIP-LSeg signals, it reconstructs a semantic-aware 3D feature field capable of novel view synthesis, depth prediction, promptable segmentation, and open-vocabulary segmentation. Key contributions include semantic anisotropic Gaussians, multi-conditioned semantic feature aggregation, segmentation and language feature distillation, and hierarchical pooling, delivering competitive results while maintaining real-time inference advantages over per-scene optimization methods. The approach enables practical, scalable open-world 3D semantics with robust cross-view consistency and semantic alignment across modalities.
Abstract
Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.
