Table of Contents
Fetching ...

PanopticSplatting: End-to-End Panoptic Gaussian Splatting

Yuxuan Xie, Xuan Yu, Changjian Jiang, Sitong Mao, Shunbo Zhou, Rui Fan, Rong Xiong, Yue Wang

TL;DR

PanopticSplatting tackles open-vocabulary 3D panoptic reconstruction by unifying semantic and instance segmentation within a Gaussian splatting framework. It introduces learnable instance queries and query-guided segmentation with local cross attention to enable end-to-end training and reasoning across views, while reducing memory costs. To handle noisy 2D pseudo masks, it employs label blending and a label warping loss to improve multi-view consistency and segmentation accuracy. The approach achieves state-of-the-art results on ScanNet-V2 and ScanNet++ against NeRF-based and Gaussian-based baselines and demonstrates robustness across different Gaussian base models.

Abstract

Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.

PanopticSplatting: End-to-End Panoptic Gaussian Splatting

TL;DR

PanopticSplatting tackles open-vocabulary 3D panoptic reconstruction by unifying semantic and instance segmentation within a Gaussian splatting framework. It introduces learnable instance queries and query-guided segmentation with local cross attention to enable end-to-end training and reasoning across views, while reducing memory costs. To handle noisy 2D pseudo masks, it employs label blending and a label warping loss to improve multi-view consistency and segmentation accuracy. The approach achieves state-of-the-art results on ScanNet-V2 and ScanNet++ against NeRF-based and Gaussian-based baselines and demonstrates robustness across different Gaussian base models.

Abstract

Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.

Paper Structure

This paper contains 14 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Existing methods for 3D panoptic reconstruction based on Gaussian splatting have multi-staged procedures. They can be broadly divided into two categories: label-lifting methods decompose the task into 2D mask alignment and label lifting; feature-lifting methods require hand-designed post-processing after feature distillation. Our method introduces instance queries to guide end-to-end panoptic reconstruction.
  • Figure 2: PanopticSplatting introduces a semantic feature and an instance feature to Gaussians to build semantic and instance feature field. In instance branch, Gaussian-modulated instance queries are introduced to guide Gaussian segmentation through local cross attention. The semantic labels of Gaussians are generated by a simple semantic decoder. Then the labels of Gaussians are rendered to 2D simultaneously. To achieve end-to-end training, linear assignment between 2D pseudo instance masks and predicted labels is performed. We employ the label warping loss on rendered semantic masks.
  • Figure 3: The pipelines of label blending and feature blending.
  • Figure 4: Comparison of the quality of panoptic segmentation and semantic segmentation of NeRF-based methods on ScanNet-V2 and ScanNet++.
  • Figure 5: Comparison of the quality of panoptic segmentation and semantic segmentation of Gaussian-based methods on ScanNet-V2 and ScanNet++.
  • ...and 1 more figures