Table of Contents
Fetching ...

Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, Tae-Hyun Oh

TL;DR

This work introduces Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting that significantly outperforms existing approaches in 3D perception benchmarks, such as openvocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks.

Abstract

We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large-scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. For video results, please visit : https://drsplat.github.io/

Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

TL;DR

This work introduces Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting that significantly outperforms existing approaches in 3D perception benchmarks, such as openvocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks.

Abstract

We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large-scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. For video results, please visit : https://drsplat.github.io/

Paper Structure

This paper contains 37 sections, 12 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Comparison of 2D (left) vs. our direct 3D search (right) for open-vocabulary 3D scene understanding. The 2D approach relies on multiview rendering, incurring high computational costs. Our method directly links language features to 3D Gaussians, enabling efficient and complete spatial coverage. The table highlights Dr. Splat’s superior efficiency over related methods.
  • Figure 2: Visualization of discrepancy in rendered 2D features and 3D features. Color indicates a cosine similarity score between query features from a text query and either (a) 3D features distilled by 2D rendering langsplat or (b) directly registered 3D features.
  • Figure 3: Overview of Dr. Splat. (a) In the preprocessing stage, we compute optimized 3D Gaussians 3dgs and Product Quantization (PQ) codebook construction. (b) During training, we extract CLIP embeddings from given images ${\{\mathbf{I}\}}$, and then proceed feature registration process (\ref{['subsec:4_1']}). Finally, we obtain 3D Gaussians $\Phi^\text{ours}$ with PQ indices $\{ j \}$ (\ref{['subsec:4_2']}).
  • Figure 4: Feature registration process in Dr. Splat. (a) We first map per-pixel CLIP embeddings $\{ \mathbf{f}^\text{map} \}$ to Gaussians. Here, we only map dominant $k$ Gaussians along pixel ray $r$, named Top-$k$ Gaussians. (b) After collecting embeddings, we compute aggregated features (\ref{['eq:weighted-averaging']}). (c) Finally, we re-use PQ to obtain the PQ indices $j$ of aggregated features and update Gaussian parameters $\Phi^\text{ours}$.
  • Figure 5: Qualitative results of the object selection on the LeRF-OVS dataset lerf. We visualize rendering of selected 3D Gaussians for LangSplat langsplat, OpenGaussian open_gaussian, and ours. For LangSplat, activations are often distributed randomly, fail to localize the target. OpenGaussian often struggles to distinguish closely situated objects. In contrast, our model shows activations precisely limited to the queried object regions, effectively localizing only the relevant areas.
  • ...and 13 more figures