Table of Contents
Fetching ...

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, Mingyang Li

TL;DR

<3-5 sentence high-level summary> FMGS presents a novel 3D scene representation that fuses Gaussian Splatting with a multi-resolution hash-encoded semantic field to embed foundation-model (CLIP/DINO) vision-language features directly into 3D space. The approach distills 2D FM embeddings into a 3D field, supervised by a hybrid CLIP feature map and reinforced by a DINO-based regularization and a pixel-alignment loss to enforce cross-view consistency. It achieves state-of-the-art open-vocabulary object detection and competitive open-vocabulary segmentation while delivering dramatically faster CLIP-feature rendering than prior 3D-language methods. The work advances open-world 3D scene understanding for AR and robotics by enabling efficient, queryable semantic constructs in uncontrolled environments.

Abstract

Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of the same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection, despite that we are 851X faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code on the project page.

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

TL;DR

<3-5 sentence high-level summary> FMGS presents a novel 3D scene representation that fuses Gaussian Splatting with a multi-resolution hash-encoded semantic field to embed foundation-model (CLIP/DINO) vision-language features directly into 3D space. The approach distills 2D FM embeddings into a 3D field, supervised by a hybrid CLIP feature map and reinforced by a DINO-based regularization and a pixel-alignment loss to enforce cross-view consistency. It achieves state-of-the-art open-vocabulary object detection and competitive open-vocabulary segmentation while delivering dramatically faster CLIP-feature rendering than prior 3D-language methods. The work advances open-world 3D scene understanding for AR and robotics by enabling efficient, queryable semantic constructs in uncontrolled environments.

Abstract

Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of the same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection, despite that we are 851X faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code on the project page.
Paper Structure (27 sections, 10 equations, 8 figures, 3 tables)

This paper contains 27 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: FMGS Training pipeline:Left: Shows how FMGS' feature field renders CLIP and DINO feature maps for loss calculation. The feature field is a multi-resolution hash encoder (MHE) muller2022instant that embeds semantic information into 3D Gaussians acquired from 3D Gaussian Splatting kerbl20233d. Right: Shows the target DINO feature map and hybrid CLIP feature map from the foundation models. Note, for visualization simplicity, we only show a single-level MHE here but in implementation we have used multiple levels and concatenate their encodings.
  • Figure 2: The features extracted from foundation models. The left three subfigures include the RGB image, extracted DINO features from the foundation model, and the hybrid CLIP feature, which is an average of multi-scale CLIP feature maps shown on the right. On the right, the shown seven CLIP feature maps are the extracted from an image pyramid at multiple scales using the foundation model. The resolution of CLIP features decreases from left to right.
  • Figure 3: FMGS Query pipeline:Top: Given a query view to localize a query, FMGS first renders the dense CLIP feature map. Bottom: given an open-vocabulary query, FMGS generates a relevancy map highlighting the relevant part of the rendered CLIP feature map to the query embedding. The highest relevant is colored as red while the lowest relevant part is colored as blue. Note, for visualization simplicity, we show a single-level MHE in this figure while used multiple in implementations.
  • Figure 4: Features for Training and Rendered Views.Left: From left to right, the figures show the RGB image, the rendered DINO feature map, the raw DINO feature map extracted for training, the rendered CLIP feature map, and the raw CLIP feature map used for training. Right: We display the relevancy scores for the rendered and raw CLIP feature maps with the text query 'flower', where the color bar indicates relevancy scores normalized within the 0-255 range. Notably, querying the raw CLIP feature map is much inferior to querying the rendered CLIP feature map.
  • Figure 5: Effect of dot product similarity (dotpsim) loss. From left to right: RGB image, rendered DINO feature without dotpsim, rendered DINO feature with dotpsim, rendered CLIP without dotpsim, and rendered CLIP feature map with dotpsim. The DINO feature maps do not have significant differences with or without dotpsim. From the CLIP feature maps, we can see that objects can be further distinguished from each other and the background. Differences are highlighted in the red boxes.
  • ...and 3 more figures