CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding
Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Qing Li, Kanglin Liu
TL;DR
CLIP-GS targets open-vocabulary 3D semantic understanding of indoor scenes by extending 3D Gaussian Splatting with a compact semantic representation and cross-view regularization. The framework introduces Semantic Attribute Compactness (SAC) to attach low-dimensional semantic embeddings to Gaussians and 3D Coherent Regularization (3DCR) to enforce 2D viewwise and 3D object-level semantic consistency. Through a two-phase training regime and Progressive Densification Regulation, CLIP-GS achieves real-time rendering while substantially improving segmentation accuracy across ScanNet, Replica, and 3DOVS, including sparse-view scenarios. The method delivers robust, coherent 3D semantics with strong open-world generalization, enabling efficient, language-driven indoor scene understanding for real-world robotics and AR/VR applications.
Abstract
Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D domains: In 2D, 3DCR utilizes refined view-consistent semantic outcomes derived from 3DGS to establish cross-view coherence constraints; in 3D, 3DCR encourages features similar among 3D Gaussian primitives associated with the same object, leading to more precise and coherent segmentation results. Extensive experimental results demonstrate that our method remarkably suppresses existing state-of-the-art approaches, achieving mIoU improvements of 21.20% and 13.05% on ScanNet and Replica datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, substantiating its robustness.
