Table of Contents
Fetching ...

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Martin R. Oswald, Danda Pani Paudel

TL;DR

SceneSplat introduces a landmark approach for open-vocabulary 3D scene understanding by operating directly on 3D Gaussian splats, paired with a large-scale SceneSplat-7K dataset and a self-supervised GaussSSL framework. The method leverages vision-language pretraining to align per-Gaussian 3D features with language, enabling zero-shot segmentation without 2D fusion at inference and achieving state-of-the-art results across multiple indoor benchmarks. The work also demonstrates robust label-free pretraining and provides extensive ablations validating design choices, while highlighting data quality and consistency considerations. Together, SceneSplat advances scalable, language-grounded 3D scene understanding and establishes standardized benchmarks for 3DGS-based reasoning in indoor environments.

Abstract

Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines.

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

TL;DR

SceneSplat introduces a landmark approach for open-vocabulary 3D scene understanding by operating directly on 3D Gaussian splats, paired with a large-scale SceneSplat-7K dataset and a self-supervised GaussSSL framework. The method leverages vision-language pretraining to align per-Gaussian 3D features with language, enabling zero-shot segmentation without 2D fusion at inference and achieving state-of-the-art results across multiple indoor benchmarks. The work also demonstrates robust label-free pretraining and provides extensive ablations validating design choices, while highlighting data quality and consistency considerations. Together, SceneSplat advances scalable, language-grounded 3D scene understanding and establishes standardized benchmarks for 3DGS-based reasoning in indoor environments.

Abstract

Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training or together at inference. This highlights the clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable manner remains an open challenge. To address these limitations, we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. To power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising 7916 scenes derived from seven established datasets, such as ScanNet and Matterport3D. Generating SceneSplat-7K required computational resources equivalent to 150 GPU days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes. Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed method over the established baselines.

Paper Structure

This paper contains 32 sections, 11 equations, 18 figures, 16 tables, 1 algorithm.

Figures (18)

  • Figure 1: We present the 3DGS indoor dataset SceneSplat-7K which includes 7K scenes generated from ARKitScenes baruch2021arkitscenes, Replica straub2019replica, ScanNet dai2017scannet, ScanNet++ yeshwanth2023scannet++, Hypersimroberts2021hypersim, 3RScan wald2019rio, and Matterport3D chang2017matterport3d. Leveraging this high-quality dataset, we propose SceneSplat, the first model to predict open-vocabulary language features for millions of 3D Gaussians in a single forward pass.
  • Figure 2: SceneSplat Overview. The SceneSplat-7K dataset enables Vision-Language Pretraining and Self-Supervised Pretraining. For vision-language pretraining, we associate each 3D Gaussian primitive with semantic features based on our label collection process and train a generalizable open-vocabulary learner that predict per-gaussian embeddings. For self-supervised pretraining, we employ Masked Gaussian Modeling to reconstruct masked primitives, Self-Distillation Learning for augmentation-invariant features, and Language-Gaussian Alignment for scenes with collected labels. The former achieves state-of-the-art zero-shot segmentation results on ScanNet200 dai2017scannet, ScanNet++ yeshwanth2023scannet++, and Matterport3D chang2017matterport3d benchmarks and the latter unlocks training on large-scale 3DGS data.
  • Figure 3: Qualitative Results of Zero-Shot 3D Semantic Segmentation on ScanNet++. SceneSplat demonstrates competitive zero-shot performance, note how our model correctly annotate the regions lacking ground truth labels, e.g., desks on the top row. Best viewed zoomed in and in color.
  • Figure 4: Text-Based 3DGS Scene Query. Given text queries and SceneSplat inference results for a 3DGS scene, we can effectively localize the corresponding splats (highlighted in red for queries "Robot Arm", "Box", and "Keyboard").
  • Figure 5: Distribution of SceneSplat Zero-Shot Semantic Segmentation mIoU w.r.t. Input 3DGS Scene PSNR. Reported on the Matterport3D test split labeled in 21 semantic classes, the box plot shows a clear positive trend between the input 3DGS scene training PSNR and the resulted mIoU once applied SceneSplat language pretraining for zero-shot semantic segmentation. This encourages the careful curation of the collected 3DGS scene dataset.
  • ...and 13 more figures