Table of Contents
Fetching ...

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

Anh-Quan Cao, Angela Dai, Raoul de Charette

TL;DR

PaSCo addresses Panoptic Scene Completion (PSC), extending Semantic Scene Completion (SSC) by predicting geometry, semantics, and instance IDs from sparse 3D inputs while providing calibrated uncertainty. It introduces a MIMO-inspired, single-pass ensemble built on a sparse CNN–Transformer backbone with a multiscale generator and a mask-based decoder to output a set of masks $(m_k,c_k)$ for $k=1\dots K$, using non-empty voxels for efficiency; final PSC output is obtained via a permutation-invariant mask ensembling using Hungarian matching, enabling both voxel- and instance-level uncertainty estimation. The method demonstrates state-of-the-art PSC and uncertainty performance on three urban LiDAR datasets and shows robustness to distribution shifts, with code and data released publicly. PaSCo’s combination of multiscale geometric guidance, mask-based predictions, and uncertainty-aware MIMO ensembling represents a practical step toward reliable, complete 3D scene understanding for robotics and autonomous driving.

Abstract

We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the non-empty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for robotics applications, we instead propose an efficient ensembling to estimate both voxel-wise and instance-wise uncertainties along PSC. This is achieved by building on a multi-input multi-output (MIMO) strategy, while improving performance and yielding better uncertainty for little additional compute. Additionally, we introduce a technique to aggregate permutation-invariant mask predictions. Our experiments demonstrate that our method surpasses all baselines in both Panoptic Scene Completion and uncertainty estimation on three large-scale autonomous driving datasets. Our code and data are available at https://astra-vision.github.io/PaSCo .

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

TL;DR

PaSCo addresses Panoptic Scene Completion (PSC), extending Semantic Scene Completion (SSC) by predicting geometry, semantics, and instance IDs from sparse 3D inputs while providing calibrated uncertainty. It introduces a MIMO-inspired, single-pass ensemble built on a sparse CNN–Transformer backbone with a multiscale generator and a mask-based decoder to output a set of masks for , using non-empty voxels for efficiency; final PSC output is obtained via a permutation-invariant mask ensembling using Hungarian matching, enabling both voxel- and instance-level uncertainty estimation. The method demonstrates state-of-the-art PSC and uncertainty performance on three urban LiDAR datasets and shows robustness to distribution shifts, with code and data released publicly. PaSCo’s combination of multiscale geometric guidance, mask-based predictions, and uncertainty-aware MIMO ensembling represents a practical step toward reliable, complete 3D scene understanding for robotics and autonomous driving.

Abstract

We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the non-empty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for robotics applications, we instead propose an efficient ensembling to estimate both voxel-wise and instance-wise uncertainties along PSC. This is achieved by building on a multi-input multi-output (MIMO) strategy, while improving performance and yielding better uncertainty for little additional compute. Additionally, we introduce a technique to aggregate permutation-invariant mask predictions. Our experiments demonstrate that our method surpasses all baselines in both Panoptic Scene Completion and uncertainty estimation on three large-scale autonomous driving datasets. Our code and data are available at https://astra-vision.github.io/PaSCo .
Paper Structure (29 sections, 5 equations, 11 figures, 9 tables)

This paper contains 29 sections, 5 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: PaSCo output. Our method infers Panoptic Scene Completion (PSC) from a sparse input point cloud while concurrently assessing uncertainty at both the voxel and instance levels.
  • Figure 2: PaSCo overview. Our method aims to predict multiple variations of Panoptic Scene Completion (PSC) given an incomplete 3D point cloud, while allowing uncertainty estimation through mask ensembling. For PSC we employ a sparse 3D generative U-Net with a transformer decoder (\ref{['sec:met_psc']}). The uncertainty awareness is enabled using multiple subnets each operating on a different augmented version of an input data source (\ref{['sec:met_uncertainty']}). PaSCo allows the first Panoptic Scene Completion while providing a robust method for uncertainty estimation. Instance-wise uncertainty shows only "things" classes for clarity.
  • Figure 3: Architecture for PSC. Our architecture builds on a sparse generative U-Net coupled with a transformer decoder applied on pruned non-empty voxels to predict PSC.
  • Figure 4: Qualitative Panoptic Scene Completion. We report PSC outputs for all baselines of \ref{['tab:psc_quantitative']}. PaSCo shows better instance separation, with stronger instance shapes and scene structure, with fewer holes.
  • Figure 5: Qualitative uncertainty comparison on SSCBench-KITTI360 and Semantic KITTI. Note that "ins. unc." only shows examples from the "thing" class for clearer visualization. PaSCo($M{=}1$) tends towards overconfidence in both voxel and ins. unc. In contrast, PaSCo gives more intuitive uncertainty estimates, e.g., at segment boundaries, in areas with hallucinated scenery, and in regions with low input point density.
  • ...and 6 more figures