Table of Contents
Fetching ...

PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM

Runnan Chen, Zhaoqing Wang, Jiepeng Wang, Yuexin Ma, Mingming Gong, Wenping Wang, Tongliang Liu

TL;DR

PanoSLAM tackles the challenge of open-world 3D panoptic reconstruction from unlabeled RGB-D video by extending 3D Gaussian Splatting with semantic and instance rendering. It introduces a Spatial-Temporal Lifting (STL) module to distill and refine noisy 2D panoptic predictions into a coherent 3D map, enabling online operation. The approach achieves state-of-the-art performance on Replica and ScanNet++ in mapping, tracking, and panoptic segmentation while reconstructing open-world scenes without manual labels. This work demonstrates how vision foundation models can be effectively integrated into a fast, differentiable 3D representation to deliver robust, label-free panoptic 3D understanding for robotics and AR applications.

Abstract

Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (https://github.com/runnanchen/PanoSLAM)

PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM

TL;DR

PanoSLAM tackles the challenge of open-world 3D panoptic reconstruction from unlabeled RGB-D video by extending 3D Gaussian Splatting with semantic and instance rendering. It introduces a Spatial-Temporal Lifting (STL) module to distill and refine noisy 2D panoptic predictions into a coherent 3D map, enabling online operation. The approach achieves state-of-the-art performance on Replica and ScanNet++ in mapping, tracking, and panoptic segmentation while reconstructing open-world scenes without manual labels. This work demonstrates how vision foundation models can be effectively integrated into a fast, differentiable 3D representation to deliver robust, label-free panoptic 3D understanding for robotics and AR applications.

Abstract

Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (https://github.com/runnanchen/PanoSLAM)
Paper Structure (29 sections, 13 equations, 3 figures, 7 tables)

This paper contains 29 sections, 13 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: We present PanoSLAM, a SLAM system based on 3D Gaussian Splatting, capable of 3D geometric, semantic and instance reconstruction from unlabeled RGB-D videos.
  • Figure 2: Overview of the PanoSLAM framework for panoptic 3D scene reconstruction from unlabeled RGB-D videos. The system comprises four main components: (1) Camera Tracking: Estimating camera poses from input RGB-D videos to facilitate 3D mapping. (2) Panoptic Information Inference: Utilizing 2D vision models to generate multi-view 2D panoptic pseudo-labels, which are re-projected into 3D space for label refinement. (3) 3D Gaussians Updating: Constructing and updating a 3D Gaussian map for efficient rendering and mapping. (4) Spatial-Temporal Lifting: Refining 3D panoptic pseudo-labels through consistent optimization across multiple views to address label noise, enabling high-quality panoptic scene reconstruction.
  • Figure 3: Qualitative results of PanoSLAM in rendering appearance, panoptic information and depth. We present four scenes in the replica dataset, named room0, room1, office0 and office2. Note that different colour in panoptic infomation indicates different instances.