Table of Contents
Fetching ...

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li

TL;DR

Semantic Gaussians address open-vocabulary 3D scene understanding by distilling 2D semantic knowledge into a 3D Gaussian Splatting framework and augmenting it with a 3D semantic network for fast inference. The approach combines a versatile 2D-to-3D projection pipeline with a $s^{\text{2D}}$ and $s^{\text{3D}}$ semantic representation, enabling language-guided querying of 3D scenes and robust cross-view consistency. Results on ScanNet and LERF demonstrate strong open-vocabulary segmentation and object localization, with broad qualitative support for part segmentation, scene editing, and spatiotemporal tracking. By uniting explicit 3D geometry with 2D semantic priors, Semantic Gaussians enable real-time, open-ended scene understanding suitable for robotics and augmented reality applications.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Existing methods adopt neurel rendering methods as 3D representations and jointly optimize color and semantic features to achieve rendering and scene understanding simultaneously. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is to distill knowledge from 2D pre-trained models to 3D Gaussians. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. The quantitative results on ScanNet segmentation and LERF object localization demonstates the superior performance of our method. Additionally, we explore several applications of Semantic Gaussians including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

TL;DR

Semantic Gaussians address open-vocabulary 3D scene understanding by distilling 2D semantic knowledge into a 3D Gaussian Splatting framework and augmenting it with a 3D semantic network for fast inference. The approach combines a versatile 2D-to-3D projection pipeline with a and semantic representation, enabling language-guided querying of 3D scenes and robust cross-view consistency. Results on ScanNet and LERF demonstrate strong open-vocabulary segmentation and object localization, with broad qualitative support for part segmentation, scene editing, and spatiotemporal tracking. By uniting explicit 3D geometry with 2D semantic priors, Semantic Gaussians enable real-time, open-ended scene understanding suitable for robotics and augmented reality applications.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Existing methods adopt neurel rendering methods as 3D representations and jointly optimize color and semantic features to achieve rendering and scene understanding simultaneously. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is to distill knowledge from 2D pre-trained models to 3D Gaussians. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. The quantitative results on ScanNet segmentation and LERF object localization demonstates the superior performance of our method. Additionally, we explore several applications of Semantic Gaussians including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.
Paper Structure (28 sections, 6 equations, 8 figures, 3 tables)

This paper contains 28 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of our Semantic Gaussians. We inject semantic features into off-the-shelf 3D Gaussian Splatting by either projecting semantic features from pre-trained 2D encoders or directly predicting pointwise embeddings by a 3D semantic network (or fusing these two). The newly added semantic components of 3D Gaussians open up diverse applications centered around open-vocabulary scene understanding.
  • Figure 2: An illustration of the pipeline of Semantic Gaussians. Upper left: our projection framework maps various pre-trained 2D features to the semantic component $s^{\text{2D}}$ of 3D Gaussians; Bottom left: we additionally introduce a 3D semantic network that directly predicts the semantic components $s^{\text{3D}}$ out of raw 3D Gaussians. It is supervised by the projected $s^{\text{2D}}$; Right: given an open-vocabulary text query, we compare its embedding against the semantic components ($s^{\text{2D}}$, $s^{\text{3D}}$, or their fusion) of 3D Gaussians. The matched Gaussians will be splatted to render the 2D mask corresponding to the query.
  • Figure 3: Visualization of scene-level semantic segmentation performance for open-vocabulary 3D scene understanding methods on ScanNet dataset.
  • Figure 4: Qualitative comparisons of different methods on the MVImgNet part segmentation task. We choose 6 classes of objects with 3, 4 and 5 parts to show the part segmentation performance.
  • Figure 5: Qualitative results of spatiotemporal tracking on the CMU Panoptic dataset. We choose 4 scenes with humans and dynamic objects to show the tracking performance.
  • ...and 3 more figures