Table of Contents
Fetching ...

PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

Yining Shi, Jiusi Li, Kun Jiang, Ke Wang, Yunlong Wang, Mengmeng Yang, Diange Yang

TL;DR

PanoSSC tackles outdoor monocular panoptic 3D scene reconstruction by predicting voxel-level occupancy, semantics, and instance IDs from a single RGB image. It introduces a two-head architecture with semantic occupancy and 3D instance completion, connected by a transformer-based 3D mask decoder and a mask-wise merging strategy to produce coherent panoptic voxel outputs. A two-stage training regime enables mutual improvement between tasks, and the method achieves competitive semantic scene completion while delivering notable gains in panoptic 3D reconstruction on SemanticKITTI. This approach advances vision-only, open-world scene understanding for autonomous driving by enabling instance-aware occupancy in 3D space.

Abstract

Vision-centric occupancy networks, which represent the surrounding environment with uniform voxels with semantics, have become a new trend for safe driving of camera-only autonomous driving perception systems, as they are able to detect obstacles regardless of their shape and occlusion. Modern occupancy networks mainly focus on reconstructing visible voxels from object surfaces with voxel-wise semantic prediction. Usually, they suffer from inconsistent predictions of one object and mixed predictions for adjacent objects. These confusions may harm the safety of downstream planning modules. To this end, we investigate panoptic segmentation on 3D voxel scenarios and propose an instance-aware occupancy network, PanoSSC. We predict foreground objects and backgrounds separately and merge both in post-processing. For foreground instance grouping, we propose a novel 3D instance mask decoder that can efficiently extract individual objects. we unify geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into PanoSSC framework and propose new metrics for evaluating panoptic voxels. Extensive experiments show that our method achieves competitive results on SemanticKITTI semantic scene completion benchmark.

PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

TL;DR

PanoSSC tackles outdoor monocular panoptic 3D scene reconstruction by predicting voxel-level occupancy, semantics, and instance IDs from a single RGB image. It introduces a two-head architecture with semantic occupancy and 3D instance completion, connected by a transformer-based 3D mask decoder and a mask-wise merging strategy to produce coherent panoptic voxel outputs. A two-stage training regime enables mutual improvement between tasks, and the method achieves competitive semantic scene completion while delivering notable gains in panoptic 3D reconstruction on SemanticKITTI. This approach advances vision-only, open-world scene understanding for autonomous driving by enabling instance-aware occupancy in 3D space.

Abstract

Vision-centric occupancy networks, which represent the surrounding environment with uniform voxels with semantics, have become a new trend for safe driving of camera-only autonomous driving perception systems, as they are able to detect obstacles regardless of their shape and occlusion. Modern occupancy networks mainly focus on reconstructing visible voxels from object surfaces with voxel-wise semantic prediction. Usually, they suffer from inconsistent predictions of one object and mixed predictions for adjacent objects. These confusions may harm the safety of downstream planning modules. To this end, we investigate panoptic segmentation on 3D voxel scenarios and propose an instance-aware occupancy network, PanoSSC. We predict foreground objects and backgrounds separately and merge both in post-processing. For foreground instance grouping, we propose a novel 3D instance mask decoder that can efficiently extract individual objects. we unify geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into PanoSSC framework and propose new metrics for evaluating panoptic voxels. Extensive experiments show that our method achieves competitive results on SemanticKITTI semantic scene completion benchmark.
Paper Structure (15 sections, 7 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 15 sections, 7 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Panoptic 3D scene reconstruction from a monocular RGB image for outdoor scenes with PanoSSC. Our method infers voxel-level occupancy, semantics and instance ids.
  • Figure 2: PanoSSC framework. We adopt 2D UNet to generate multi-scale image features and lift them to 3D space with TPVFormer TPVFormer. After broadcasting TPV features, the voxel features are used for 3D semantic occupancy prediction and instance completion respectively. During inference, we adopt a mask-wise strategy to merge the results of two prediction heads.
  • Figure 3: 3D mask decoder. We input the voxel features from TPVFormer TPVFormer and the initialized thing queries into the transformer-based 3D mask decoder, which can generate 3D instance masks from attention maps and probabilities over all foreground categories from refined queries.
  • Figure 4: Visualization on the SemanticKITTI SemanticKITTI validation set. Each pair of rows shows the results of semantic scene completion (upper) and 3D instance completion for vehicle (lower). Different color bars represent different categories on the SSC task, while colors indicate different instance for 3D instance completion. The darker voxels are outside FOV of the image. Compared to MonoScene MonoScene, our PanoSSC can better capture the road layout (row $7$) and estimate the shape of vehicles (rows $1-6$), especially when they are close. It can also better distinguish similar categories, e.g. car and truck (rows $1-4$).
  • Figure 5: Instance ids obtained from Euclidean clustering on the ground truth for SSC.
  • ...and 2 more figures