Table of Contents
Fetching ...

LangSplat: 3D Language Gaussian Splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, Hanspeter Pfister

TL;DR

This work presents LangSplat, a fast and accurate framework for open-vocabulary querying in 3D scenes. It replaces NeRF-based grounding with 3D Gaussian Splatting augmented by language embeddings, uses a scene-specific autoencoder to curb memory costs, and leverages SAM-derived hierarchical semantics to address point ambiguity. The approach achieves state-of-the-art results on open-vocabulary 3D object localization and semantic segmentation, while delivering substantial speedups over prior methods such as LERF. These contributions enable practical, high-resolution 3D language querying with improved object boundaries and efficiency.

Abstract

Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/

LangSplat: 3D Language Gaussian Splatting

TL;DR

This work presents LangSplat, a fast and accurate framework for open-vocabulary querying in 3D scenes. It replaces NeRF-based grounding with 3D Gaussian Splatting augmented by language embeddings, uses a scene-specific autoencoder to curb memory costs, and leverages SAM-derived hierarchical semantics to address point ambiguity. The approach achieves state-of-the-art results on open-vocabulary 3D object localization and semantic segmentation, while delivering substantial speedups over prior methods such as LERF. These contributions enable practical, high-resolution 3D language querying with improved object boundaries and efficiency.

Abstract

Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 speedup compared to LERF at the resolution of 1440 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/
Paper Structure (17 sections, 6 equations, 11 figures, 7 tables)

This paper contains 17 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Visualization of learned 3D language features of the previous SOTA method LERF and our LangSplat. While LERF generates imprecise and vague 3D features, our LangSplat accurately captures object boundaries and provides precise 3D language fields. While being effective, our LangSplat is also 199$\times$ faster than LERF at the resolution of 1440 $\times$ 1080.
  • Figure 2: The framework of our LangSplat. Our LangSplat leverages SAM to learn hierarchical semantics to address the point ambiguity issue. Then segment masks are sent to the CLIP image encoder to extract the corresponding CLIP embeddings. We learn an autoencoder with these obtained CLIP embeddings. Our 3D language Gaussian learn language features on the scene-specific latent space to reduce the memory cost. During querying, the rendered language embeddings are sent to the decoder to recover the features on the CLIP space.
  • Figure 3: Qualitative comparisons of open-vocabulary 3D object localization on the LERF dataset. The red points are the model predictions and the black dashed bounding boxes denote the annotations.
  • Figure 4: Qualitative comparisons of open-vocabulary 3D semantic segmentation on the LERF dataset.
  • Figure 5: Qualitative comparisons of different methods on the 3D-OVS dataset. We visualize the segmentation results in 2 scenes. We observe that our method gives the most accurate segmentation maps.
  • ...and 6 more figures