Table of Contents
Fetching ...

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

Wenbo Zhang, Lu Zhang, Ping Hu, Liqian Ma, Yunzhi Zhuge, Huchuan Lu

TL;DR

This work introduces the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian, and introduces a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process.

Abstract

Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered significant attention. While current approaches typically distill 3D semantic features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel view segmentation and semantic understanding, their heavy reliance on 2D supervision can undermine cross-view semantic consistency and necessitate complex data preparation processes, therefore hindering view-consistent scene understanding. In this work, we present FreeGS, an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels. Instead of directly learning semantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian. We optimize IDSF with a two-step alternating strategy: semantics help to extract coherent instances in 3D space, while the resulting instances regularize the injection of stable semantics from 2D space. Additionally, we adopt a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process, enabling FreeGS to uniformly perform tasks such as novel-view semantic segmentation, object selection, and 3D object detection. Extensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload. Our code is publicly available at https://github.com/wb014/FreeGS.

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

TL;DR

This work introduces the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian, and introduces a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process.

Abstract

Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered significant attention. While current approaches typically distill 3D semantic features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel view segmentation and semantic understanding, their heavy reliance on 2D supervision can undermine cross-view semantic consistency and necessitate complex data preparation processes, therefore hindering view-consistent scene understanding. In this work, we present FreeGS, an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels. Instead of directly learning semantic features, we introduce the IDentity-coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view-consistent instance indices for each Gaussian. We optimize IDSF with a two-step alternating strategy: semantics help to extract coherent instances in 3D space, while the resulting instances regularize the injection of stable semantics from 2D space. Additionally, we adopt a 2D-3D joint contrastive loss to enhance the complementarity between view-consistent 3D geometry and rich semantics during the bootstrapping process, enabling FreeGS to uniformly perform tasks such as novel-view semantic segmentation, object selection, and 3D object detection. Extensive experiments on LERF-Mask, 3D-OVS, and ScanNet datasets demonstrate that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload. Our code is publicly available at https://github.com/wb014/FreeGS.

Paper Structure

This paper contains 15 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We present a novel framework, FreeGS, to inject 2D semantics () into 3DGS (), without the need of any 2D labels. After a 2D-3D collaborative bootstrapping learning strategy, the model can support versatile applications, such as novel-view 2D segmentation, open-vocabulary 3D detection, and interactive object selection.
  • Figure 2: Framework overview of FreeGS. The framework consists of three key components: Union-space 3D Gaussian Clustering, Multi-level 2D Semantic Distillation, and 2D-3D Joint Contrastive Learning. In the 3D space, Gaussians equipped with the IDentity-coupled Semantic Fields (IDSF) are input to the union-space clustering module to extract view-consistent instance indices. Subsequently, the IDSF is rendered onto 2D space and supervised by multi-level features from foundational models. Additionally, a 2D-3D joint contrastive loss is applied between instance-aware 3D features and rendered 2D features to enhance the compactness and discrimination of semantics in the joint feature space. The alternating updates of the semantic field and the instance clustering bootstrap view-consistent semantics in Gaussians, without relying on any 2D labels.
  • Figure 3: Qualitative comparisons of different methods on LERF-Mask and 3D-OVS dataset. Our method successfully segment instances with consistency across different views.
  • Figure 4: Qualitative comparisons of different methods on "Scene0462" of the ScanNet dataset. We use yellow lines to draw the ground truth 3D bounding boxes. For the two queries in the figure, LangSplat generates bounding boxes as large as the whole scene because of the existence of noise points, so we do not show box visualization for LangSplat. While our method demonstrates much more accurate localization and segmentation.
  • Figure 5: Visualization of ablation results on the "bench" scene of 3D-OVS dataset.
  • ...and 1 more figures