Table of Contents
Fetching ...

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Yueen Ma, Irwin King

Abstract

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Abstract

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
Paper Structure (29 sections, 41 equations, 4 figures, 2 tables)

This paper contains 29 sections, 41 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: X-GS is an extensible open framework that unifies previously isolated domains, including pose-free 3DGS, 3DGS-based SLAM, Semantic GS, and VLMs for 3DGS. X-GS-Perceiver achieves real-time 3DGS-based online SLAM with semantics using an online VQ module, GPU-accelerated grid-sampling, and highly parallelized scheduling. X-GS-Thinker bridges the resulting semantic 3DGS representation with downstream multimodal models to execute complex, language-driven tasks such as open-vocabulary 3D object detection, caption generation, and potentially embodied tasks.
  • Figure 2: Overview of the X-GS framework.X-GS-Perceiver synergizes a memory-efficient Vector Quantization (VQ) module, grid-based semantic supervision, and an asynchronous parallelized pipeline to perform SLAM and distill semantics simultaneously in an online fashion, operating in real time at $\sim$15 FPS. As an open framework, it accommodates both RGB-only and RGB-D inputs, and can flexibly integrate various Vision Foundation Models (VFMs). Furthermore, the X-GS-Thinker component is extensible to different multimodal models, enabling a wide range of downstream tasks.
  • Figure 3: Qualitative results of X-GS on scene reconstruction and semantic distillation. From left to right: Ground Truth (GT) RGB, Rendered RGB, GT Semantic Map (from VFMs, SAM + CLIP), Rendered Semantic Map, and an open-vocabulary Object Detection example.
  • Figure 4: Qualitative results of X-GS for 3D scene caption generation.