Table of Contents
Fetching ...

LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS

Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, Hanspeter Pfister

TL;DR

LangSplatV2 tackles the bottleneck of real-time, high-dimensional 3D language querying by replacing the heavyweight MLP decoder with a 3D sparse coefficient field that uses a global codebook of $L$ basis vectors and sparse per-Gaussian coefficients of size $K$. This decouples rendering from feature dimensionality and enables high-dimensional ($D$) CLIP features to be produced at the cost of ultra-low-dimensional splatting, achieving up to $384.6$ FPS in open-vocabulary 3D querying and a total per-query time around $2.6$ ms on a single $A100$. The approach yields superior or competitive query accuracy across LERF, 3D-OVS, and Mip-NeRF360, with ablations showing $L=64$ and $K=4$ as effective choices. While training costs increase due to constructing high-dimensional semantic fields, the method delivers practical real-time inference for language-grounded 3D scenes and demonstrates substantial speedups over prior work.

Abstract

In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 $\times$ speedup and a 47 $\times$ boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real-time inference performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster. Codes and demos are available at our project page: https://langsplat-v2.github.io.

LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS

TL;DR

LangSplatV2 tackles the bottleneck of real-time, high-dimensional 3D language querying by replacing the heavyweight MLP decoder with a 3D sparse coefficient field that uses a global codebook of basis vectors and sparse per-Gaussian coefficients of size . This decouples rendering from feature dimensionality and enables high-dimensional () CLIP features to be produced at the cost of ultra-low-dimensional splatting, achieving up to FPS in open-vocabulary 3D querying and a total per-query time around ms on a single . The approach yields superior or competitive query accuracy across LERF, 3D-OVS, and Mip-NeRF360, with ablations showing and as effective choices. While training costs increase due to constructing high-dimensional semantic fields, the method delivers practical real-time inference for language-grounded 3D scenes and demonstrates substantial speedups over prior work.

Abstract

In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 speedup and a 47 boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real-time inference performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster. Codes and demos are available at our project page: https://langsplat-v2.github.io.

Paper Structure

This paper contains 17 sections, 7 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Feature rendering time comparison with different GPUs. Note that the less advanced GPUs (RTX 3090 and RTX 4090) cannot accommodate the LangSplat model with feature dimensions of 1024 or higher due to running out of memory.
  • Figure 2: The framework of LangSplatV2. LangSplatV2 introduces a sparse coefficient for each Gaussian point and a shared global codebook for the entire scene.
  • Figure 3: Our efficient sparse coefficient splatting method accelerates the speed of alpha-blending by utilizing the property of the learned sparse coefficient field and neglecting zero elements.
  • Figure 4: Qualitative comparisons of open-vocabulary 3D object localization on the LERF dataset. The red points are the model predictions and the black dashed bounding boxes denote the annotations. We observe that LangSplatV2 generates better results than LangSplat.
  • Figure 5: Qualitative comparisons of open-vocabulary 3D semantic segmentation on the LERF, Mip-NeRF360 and 3D-OVS dataset. We can see that our LangSplatV2 generates better masks than LangSplat, which shows the effectiveness of our LangSplatV2.
  • ...and 2 more figures