Table of Contents
Fetching ...

3D Vision-Language Gaussian Splatting

Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu

TL;DR

This work tackles the imbalance between visual and language modalities in 3D scene understanding by introducing 3D vision-language Gaussian splatting. It adds a cross-modal rasterizer that fuses color and semantic features before rasterization and introduces a learnable per-Gaussian semantic indicator to better represent language information, especially for translucent or reflective objects. A camera-view blending regularization further stabilizes semantic learning across views, yielding state-of-the-art open-vocabulary semantic segmentation on benchmarks like LERF and 3D-OVS. The approach demonstrates robust semantic representation in complex scenes while maintaining competitive visual rendering quality, with practical implications for robotics, AR/VR, and autonomous systems.

Abstract

Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.

3D Vision-Language Gaussian Splatting

TL;DR

This work tackles the imbalance between visual and language modalities in 3D scene understanding by introducing 3D vision-language Gaussian splatting. It adds a cross-modal rasterizer that fuses color and semantic features before rasterization and introduces a learnable per-Gaussian semantic indicator to better represent language information, especially for translucent or reflective objects. A camera-view blending regularization further stabilizes semantic learning across views, yielding state-of-the-art open-vocabulary semantic segmentation on benchmarks like LERF and 3D-OVS. The approach demonstrates robust semantic representation in complex scenes while maintaining competitive visual rendering quality, with practical implications for robotics, AR/VR, and autonomous systems.

Abstract

Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.

Paper Structure

This paper contains 32 sections, 12 equations, 15 figures, 23 tables.

Figures (15)

  • Figure 1: Comparison of prior semantic 3DGS work and our novel method. We apply cross-modal rasterization and camera-view-based regularization for better exploration of semantic features.
  • Figure 2: Overview of our proposed framework. A) We propose a novel multi-modal Gaussian splatting model; B) we enrich the input images and poses for the model to better fit the semantic information. Besides our introduction of a novel semantic indicator parameter $l$, our additional contributions are: C) a semantic-aware cross-modal rasterization module; and D) a camera view blending augmentation scheme for training regularization.
  • Figure 3: Empirical differences between color opacity and proposed smoothed semantic indicator. On the left, we visualize the difference $l^i - o^i$ in Gaussians modeling the ramen scene. While their color opacity may vary significantly, most Gaussians need their semantic features to be rasterized with minimal blending (c.f. red regions in the difference maps, i.e., where $l^i \gg o^i$), except for Gaussians representing intangible lighting effects (c.f. glares on the bottle, bowl, table, etc.). On the right, we further plot the density distribution of $l^i - o^i$ in Gaussians for the ramen scene, indicating the different distributions of these two control parameters.
  • Figure 4: Qualitative semantic segmentation comparisons on the ramen scene (LERF dataset).
  • Figure 5: Qualitative semantic segmentation comparisons on the kitchen scene (LERF dataset).
  • ...and 10 more figures