Table of Contents
Fetching ...

CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding

Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Qing Li, Kanglin Liu

TL;DR

CLIP-GS targets open-vocabulary 3D semantic understanding of indoor scenes by extending 3D Gaussian Splatting with a compact semantic representation and cross-view regularization. The framework introduces Semantic Attribute Compactness (SAC) to attach low-dimensional semantic embeddings to Gaussians and 3D Coherent Regularization (3DCR) to enforce 2D viewwise and 3D object-level semantic consistency. Through a two-phase training regime and Progressive Densification Regulation, CLIP-GS achieves real-time rendering while substantially improving segmentation accuracy across ScanNet, Replica, and 3DOVS, including sparse-view scenarios. The method delivers robust, coherent 3D semantics with strong open-world generalization, enabling efficient, language-driven indoor scene understanding for real-world robotics and AR/VR applications.

Abstract

Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D domains: In 2D, 3DCR utilizes refined view-consistent semantic outcomes derived from 3DGS to establish cross-view coherence constraints; in 3D, 3DCR encourages features similar among 3D Gaussian primitives associated with the same object, leading to more precise and coherent segmentation results. Extensive experimental results demonstrate that our method remarkably suppresses existing state-of-the-art approaches, achieving mIoU improvements of 21.20% and 13.05% on ScanNet and Replica datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, substantiating its robustness.

CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding

TL;DR

CLIP-GS targets open-vocabulary 3D semantic understanding of indoor scenes by extending 3D Gaussian Splatting with a compact semantic representation and cross-view regularization. The framework introduces Semantic Attribute Compactness (SAC) to attach low-dimensional semantic embeddings to Gaussians and 3D Coherent Regularization (3DCR) to enforce 2D viewwise and 3D object-level semantic consistency. Through a two-phase training regime and Progressive Densification Regulation, CLIP-GS achieves real-time rendering while substantially improving segmentation accuracy across ScanNet, Replica, and 3DOVS, including sparse-view scenarios. The method delivers robust, coherent 3D semantics with strong open-world generalization, enabling efficient, language-driven indoor scene understanding for real-world robotics and AR/VR applications.

Abstract

Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D domains: In 2D, 3DCR utilizes refined view-consistent semantic outcomes derived from 3DGS to establish cross-view coherence constraints; in 3D, 3DCR encourages features similar among 3D Gaussian primitives associated with the same object, leading to more precise and coherent segmentation results. Extensive experimental results demonstrate that our method remarkably suppresses existing state-of-the-art approaches, achieving mIoU improvements of 21.20% and 13.05% on ScanNet and Replica datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, substantiating its robustness.
Paper Structure (21 sections, 11 equations, 14 figures, 10 tables)

This paper contains 21 sections, 11 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Visual comparisons between different CLIP-informed 3D semantic segmentation methods using the text query "Picture" across different views. The NeRF-based method, 3DOVS 3DOVS, and 3DGS-based methods, Feature 3DGS feature3dgs and LangSplat langsplat exhibit ambiguous semantics and efficiency bottlenecks. In contrast, our approach achieves more precise and consistent semantic segmentation results with a faster speed.
  • Figure 2: Illustration of CLIP-GS optimization.Left: CLIP-GS represents the 3D scene with a collection of 3D Gaussians 3dgs with learnable attributes, specifically adding a semantic attribute. Right: First, multi-view images undergo feature extraction using the frozen CLIP model OpenAICLIP and region mask generation with SAM sam. We then optimize CLIP-GS in an end-to-end manner through two phases. In Phase I, we introduce Semantic Attribute Compactness (SAC) to capture the unified semantics within each object, facilitating efficient optimization and rendering of semantic Gaussians. In Phase II, after training 3D Gaussians at certain iterations, we present 3D Coherent Regularization (3DCR) to enhance 3D semantic consistency. 3DCR leverages self-predicted semantics derived from CLIP-GS, refined by cross-view coherent regularization, to provide view-consistent supervision signals for optimizing Gaussians. Additionally, 3DCR identifies 3D Gaussian primitives associated with the same object through ray-based intersection matching and encourages their semantics to be similar. The color optimization process follows 3DGS 3dgs and is omitted for brevity.
  • Figure 3: Illustration of rendered segmentation maps with the text query "Rug". (a) and (b) correspond to training views, while (c) and (d) pertain to testing views. In (a), the rendered result appears ambiguous when immutably employing 2D CLIP semantics for Gaussian optimization. Conversely, (b) shows that leveraging 3DCR can provide coherent semantic constraints to supervise Gaussians, leading to more precise results (d). For more details, refer to Sec. \ref{['subsec:3DCR']}.
  • Figure 4: Illustration of the consistency regularization for 3D Gaussians in 3D Coherent Regularization (3DCR).
  • Figure 5: Visual segmentation results of novel view under multi-view training data conditions. While current methods produce ambiguous segmentation results due to the absence of 3D consistency constraints, our method achieves more precise and view-consistent results across various views.
  • ...and 9 more figures