Table of Contents
Fetching ...

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan

TL;DR

This work addresses open-vocabulary querying in 3D scenes by augmenting 3D Gaussian Splatting with a compact, quantized language feature space and an uncertainty-guided semantic smoothing mechanism. It introduces a dense language feature extraction pipeline that fuses CLIP and DINO, followed by a max-cosine quantization to a discrete codebook, enabling memory-efficient embedding on dense 3D Gaussians. A compact semantic feature vector per Gaussian is learned and rendered to 2D, then decoded to discrete indices with a cross-entropy loss; semantic uncertainty and an adaptive low-frequency spatial smoothing loss suppress cross-view inconsistencies. Experiments on six Mip-NeRF360 scenes show state-of-the-art visual quality and open-vocabulary querying accuracy with real-time rendering on consumer hardware, highlighting practical impact for interactive 3D scene understanding and editing.

Abstract

Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

TL;DR

This work addresses open-vocabulary querying in 3D scenes by augmenting 3D Gaussian Splatting with a compact, quantized language feature space and an uncertainty-guided semantic smoothing mechanism. It introduces a dense language feature extraction pipeline that fuses CLIP and DINO, followed by a max-cosine quantization to a discrete codebook, enabling memory-efficient embedding on dense 3D Gaussians. A compact semantic feature vector per Gaussian is learned and rendered to 2D, then decoded to discrete indices with a cross-entropy loss; semantic uncertainty and an adaptive low-frequency spatial smoothing loss suppress cross-view inconsistencies. Experiments on six Mip-NeRF360 scenes show state-of-the-art visual quality and open-vocabulary querying accuracy with real-time rendering on consumer hardware, highlighting practical impact for interactive 3D scene understanding and editing.

Abstract

Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.
Paper Structure (33 sections, 22 equations, 11 figures, 5 tables)

This paper contains 33 sections, 22 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: We present Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary querying. The top row visualizes the original image, novel view synthesis result with query relevancy and PCA of learned semantic features. The bottom row compares our method with other language-embedded representations. The right-side bar maps relevancy values to heatmap colors. Our method achieves better fidelity and query accuracy while rendering at higher frame rates.
  • Figure 2: The training process for Language-embedded 3D Gaussians starts with initializing scenes following 3D Gaussian Splatting kerbl20233d and randomly initializing semantic features and setting uncertainty to zero. Dense language features from multi-view CLIP radford2021learning and DINO caron2021emerging are quantized to create a discrete feature space and semantic indices. These attributes of the 3D Gaussians are then rendered into 2D maps using a differentiable rasterizer. The optimization is achieved through semantic and adaptive spatial smoothing loss.
  • Figure 3: Comparison of novel view synthesis and query relevance visualization. Left to right: Ground truth novel view synthesis, novel view images with relevance visualization from our method, DFF kobayashi2022decomposing, LeRF kerr2023lerf, and 3DOVS liu20233d. Top to bottom: Query words "asphalt ground", "bicycle", "jar of coconut oil", "flower", "LEGO Technic 856 Bulldozer", and "brown shoes".
  • Figure 4: Visual quality comparison of novel view synthesis results. Our method is able to recover more detailed geometry and appearance compared to other methods.
  • Figure 5: Images of various open-vocabulary queries.
  • ...and 6 more figures