Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan
TL;DR
This work addresses open-vocabulary querying in 3D scenes by augmenting 3D Gaussian Splatting with a compact, quantized language feature space and an uncertainty-guided semantic smoothing mechanism. It introduces a dense language feature extraction pipeline that fuses CLIP and DINO, followed by a max-cosine quantization to a discrete codebook, enabling memory-efficient embedding on dense 3D Gaussians. A compact semantic feature vector per Gaussian is learned and rendered to 2D, then decoded to discrete indices with a cross-entropy loss; semantic uncertainty and an adaptive low-frequency spatial smoothing loss suppress cross-view inconsistencies. Experiments on six Mip-NeRF360 scenes show state-of-the-art visual quality and open-vocabulary querying accuracy with real-time rendering on consumer hardware, highlighting practical impact for interactive 3D scene understanding and editing.
Abstract
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.
