Table of Contents
Fetching ...

GSemSplat: Generalizable Semantic 3D Gaussian Splatting from Uncalibrated Image Pairs

Xingrui Wang, Cuiling Lan, Hanxin Zhu, Zhibo Chen, Yan Lu

TL;DR

This work introduces GSemSplat, a framework for generalizable 3D semantic fields that attaches open-vocabulary semantics to Gaussian splats learned from sparse, uncalibrated image pairs. By augmenting the Splatt3R backbone with a dual-feature semantic head—region-specific and context-aware CLIP-derived representations—GSemSplat achieves fast, feed-forward inference without per-scene optimization and demonstrates superior semantic understanding on ScanNet++. The method enables robust open-vocabulary querying in 3D without dense pose estimation, achieving orders-of-magnitude speedups over prior scene-specific approaches while maintaining strong semantic grounding and reasonable RGB quality. Ablation studies and generalization tests show that combining dual features and a carefully designed querying strategy yields reliable semantics across diverse scenes and datasets, marking a practical advance for generalizable 3D scene understanding.

Abstract

Modeling and understanding the 3D world is crucial for various applications, from augmented reality to robotic navigation. Recent advancements based on 3D Gaussian Splatting have integrated semantic information from multi-view images into Gaussian primitives. However, these methods typically require costly per-scene optimization from dense calibrated images, limiting their practicality. In this paper, we consider the new task of generalizable 3D semantic field modeling from sparse, uncalibrated image pairs. Building upon the Splatt3R architecture, we introduce GSemSplat, a framework that learns open-vocabulary semantic representations linked to 3D Gaussians without the need for per-scene optimization, dense image collections or calibration. To ensure effective and reliable learning of semantic features in 3D space, we employ a dual-feature approach that leverages both region-specific and context-aware semantic features as supervision in the 2D space. This allows us to capitalize on their complementary strengths. Experimental results on the ScanNet++ dataset demonstrate the effectiveness and superiority of our approach compared to the traditional scene-specific method. We hope our work will inspire more research into generalizable 3D understanding.

GSemSplat: Generalizable Semantic 3D Gaussian Splatting from Uncalibrated Image Pairs

TL;DR

This work introduces GSemSplat, a framework for generalizable 3D semantic fields that attaches open-vocabulary semantics to Gaussian splats learned from sparse, uncalibrated image pairs. By augmenting the Splatt3R backbone with a dual-feature semantic head—region-specific and context-aware CLIP-derived representations—GSemSplat achieves fast, feed-forward inference without per-scene optimization and demonstrates superior semantic understanding on ScanNet++. The method enables robust open-vocabulary querying in 3D without dense pose estimation, achieving orders-of-magnitude speedups over prior scene-specific approaches while maintaining strong semantic grounding and reasonable RGB quality. Ablation studies and generalization tests show that combining dual features and a carefully designed querying strategy yields reliable semantics across diverse scenes and datasets, marking a practical advance for generalizable 3D scene understanding.

Abstract

Modeling and understanding the 3D world is crucial for various applications, from augmented reality to robotic navigation. Recent advancements based on 3D Gaussian Splatting have integrated semantic information from multi-view images into Gaussian primitives. However, these methods typically require costly per-scene optimization from dense calibrated images, limiting their practicality. In this paper, we consider the new task of generalizable 3D semantic field modeling from sparse, uncalibrated image pairs. Building upon the Splatt3R architecture, we introduce GSemSplat, a framework that learns open-vocabulary semantic representations linked to 3D Gaussians without the need for per-scene optimization, dense image collections or calibration. To ensure effective and reliable learning of semantic features in 3D space, we employ a dual-feature approach that leverages both region-specific and context-aware semantic features as supervision in the 2D space. This allows us to capitalize on their complementary strengths. Experimental results on the ScanNet++ dataset demonstrate the effectiveness and superiority of our approach compared to the traditional scene-specific method. We hope our work will inspire more research into generalizable 3D understanding.

Paper Structure

This paper contains 19 sections, 3 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Comparison of methods for obtaining semantic 3D Gaussians. (a) Previous per-scene optimization-based methods require dense calibrated images and costly iterative optimization. (b) Our generalizable method allows fast feed-forward inference with spare uncalibrated (i.e., pose-free) images as input.
  • Figure 2: Given an input image (a), we present the response map when we use the text query of "floor" to retrieve the correlated content from the region-specific feature in (b), and from the context-aware feature in (c). For the region-specific feature on the floor region, due to lack of context, the region presents a low correlation with the query. In contrast, the floor region presents a high correlation for context-aware features as shown in (c).
  • Figure 3: Illustration of our overall framework GSemSplat, for generalizable open-vocabulary 3D scene understanding, obviating the need for costly and complicated per-scene optimization, tedious extensive image collection and calibration processes. We use the generalizable 3D Gaussian Splatting architecture, Splatt3R smart2024splatt3r, as our base network, which predicts 3D Gaussians from uncalibrated image pairs. We introduce a new semantic head that predicts the low-dimensional semantics associated with each Gaussian. Without relying on costly human annotation of 3D semantics, we distill the 3D semantic information from the 2D semantic features. Particularly, GSemSplat simultaneously predicts region-specific semantic features, and context-aware semantic features, facilitating text querying to localize the more reliable semantics. The predicted low-dimensional semantic features are transformed into high-dimensional counterparts via MLP blocks for open-vocabulary semantic understanding.
  • Figure 4: Given an input image (a), we present the response map when we use the text query of "chair" to retrieve the correlated content from the region-specific feature in (b), and from the context-aware feature in (c). For the context-aware feature, the cabinet region mistakenly presents a high correlation with the query due to interference from the neighboring chair region. In contrast, the cabinet region presents a low correlation for the region-specific feature as shown in (b).
  • Figure 5: Open-vocabulary semantic segmentation results (the last two columns) of LangSplat are poor when it is trained/optimized with sparse views (two context views in the first two columns).
  • ...and 11 more figures