Table of Contents
Fetching ...

econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians

Can Zhang, Gim Hee Lee

TL;DR

econSG tackles open-vocabulary 3D semantic segmentation with 3D Gaussian Splatting by introducing two key innovations: Confidence-region Guided Regularization (CRR), which mutual-refines 2D VLM features from OpenSeg and SAM to yield complete, boundary-accurate semantic masks across views; and a low-dimensional 3D contextual space learned via a one-time pre-trained autoencoder that fuses backprojected multi-view features for efficient initialization and supervision of the 3DGS semantic fields. The method initializes and optimizes semantic fields in this latent space using CRR-aligned supervision and a compact cross-view loss, achieving state-of-the-art results on four benchmarks while significantly improving training efficiency. By backprojecting 2D features into 3D and performing dimensionality reduction after fusion, econSG enforces strong multi-view consistency without sacrificing semantic richness. The approach enables robust open-world segmentation, language-guided editing, and 3D object localization with practical speedups for real-time or interactive applications.

Abstract

The primary focus of most recent works on open-vocabulary neural fields is extracting precise semantic features from the VLMs and then consolidating them efficiently into a multi-view consistent 3D neural fields representation. However, most existing works over-trusted SAM to regularize image-level CLIP without any further refinement. Moreover, several existing works improved efficiency by dimensionality reduction of semantic features from 2D VLMs before fusing with 3DGS semantic fields, which inevitably leads to multi-view inconsistency. In this work, we propose econSG for open-vocabulary semantic segmentation with 3DGS. Our econSG consists of: 1) A Confidence-region Guided Regularization (CRR) that mutually refines SAM and CLIP to get the best of both worlds for precise semantic features with complete and precise boundaries. 2) A low dimensional contextual space to enforce 3D multi-view consistency while improving computational efficiency by fusing backprojected multi-view 2D features and follow by dimensional reduction directly on the fused 3D features instead of operating on each 2D view separately. Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods. Furthermore, we are also the most efficient training among all the methods.

econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians

TL;DR

econSG tackles open-vocabulary 3D semantic segmentation with 3D Gaussian Splatting by introducing two key innovations: Confidence-region Guided Regularization (CRR), which mutual-refines 2D VLM features from OpenSeg and SAM to yield complete, boundary-accurate semantic masks across views; and a low-dimensional 3D contextual space learned via a one-time pre-trained autoencoder that fuses backprojected multi-view features for efficient initialization and supervision of the 3DGS semantic fields. The method initializes and optimizes semantic fields in this latent space using CRR-aligned supervision and a compact cross-view loss, achieving state-of-the-art results on four benchmarks while significantly improving training efficiency. By backprojecting 2D features into 3D and performing dimensionality reduction after fusion, econSG enforces strong multi-view consistency without sacrificing semantic richness. The approach enables robust open-world segmentation, language-guided editing, and 3D object localization with practical speedups for real-time or interactive applications.

Abstract

The primary focus of most recent works on open-vocabulary neural fields is extracting precise semantic features from the VLMs and then consolidating them efficiently into a multi-view consistent 3D neural fields representation. However, most existing works over-trusted SAM to regularize image-level CLIP without any further refinement. Moreover, several existing works improved efficiency by dimensionality reduction of semantic features from 2D VLMs before fusing with 3DGS semantic fields, which inevitably leads to multi-view inconsistency. In this work, we propose econSG for open-vocabulary semantic segmentation with 3DGS. Our econSG consists of: 1) A Confidence-region Guided Regularization (CRR) that mutually refines SAM and CLIP to get the best of both worlds for precise semantic features with complete and precise boundaries. 2) A low dimensional contextual space to enforce 3D multi-view consistency while improving computational efficiency by fusing backprojected multi-view 2D features and follow by dimensional reduction directly on the fused 3D features instead of operating on each 2D view separately. Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods. Furthermore, we are also the most efficient training among all the methods.

Paper Structure

This paper contains 20 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Our econSG framework. 1) Top: Building 3D contextual latent space. We use the image encode from a VLM and our CRR to get 2D features $\hat{\mathcal{F}}^{2D}$, which are then back-projected and fused in 3D to get the high dimensional 3D contextual code $\mathcal{M}$. An autoencoder $[g, h]$ is learned to map $\mathcal{M}$ into the low dimensional space $\mathcal{M}_z$. 2) Bottom: 3DGS for semantic fields. We optimize for the 3DGS semantic fields $f$ with $\mathcal{L}_{semantic}$ and $\mathcal{L}_{ce}$ supervised by the image $\hat{\mathcal{F}}^{2D}$ and query$T_z$ latent embeddings obtained by the encoder $g$, respectively. $\mathcal{M}_z$ is used to initialize $f$.
  • Figure 2: Qualitative comparison of our econSG with Gaussian Grouping ye2023gaussian on Replica.
  • Figure 3: Ablation on confidence-guided region regularization (CRR) with qualitative results of our econSG on Replica. Panels (a)-(e) are from training views, and panels (f)-(g) are from testing views.
  • Figure 4: Qualitative 3D Segmentation results and comparison of our method. The second and fourth rows illustrate the feature visualization in 3D space.
  • Figure 5: Qualitative examples of language-guided segmentation and editing. Segmentation results of the rendering views are compared with Gaussian Grouping on LERF-localization dataset.
  • ...and 8 more figures