PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving
Yining Shi, Jiusi Li, Kun Jiang, Ke Wang, Yunlong Wang, Mengmeng Yang, Diange Yang
TL;DR
PanoSSC tackles outdoor monocular panoptic 3D scene reconstruction by predicting voxel-level occupancy, semantics, and instance IDs from a single RGB image. It introduces a two-head architecture with semantic occupancy and 3D instance completion, connected by a transformer-based 3D mask decoder and a mask-wise merging strategy to produce coherent panoptic voxel outputs. A two-stage training regime enables mutual improvement between tasks, and the method achieves competitive semantic scene completion while delivering notable gains in panoptic 3D reconstruction on SemanticKITTI. This approach advances vision-only, open-world scene understanding for autonomous driving by enabling instance-aware occupancy in 3D space.
Abstract
Vision-centric occupancy networks, which represent the surrounding environment with uniform voxels with semantics, have become a new trend for safe driving of camera-only autonomous driving perception systems, as they are able to detect obstacles regardless of their shape and occlusion. Modern occupancy networks mainly focus on reconstructing visible voxels from object surfaces with voxel-wise semantic prediction. Usually, they suffer from inconsistent predictions of one object and mixed predictions for adjacent objects. These confusions may harm the safety of downstream planning modules. To this end, we investigate panoptic segmentation on 3D voxel scenarios and propose an instance-aware occupancy network, PanoSSC. We predict foreground objects and backgrounds separately and merge both in post-processing. For foreground instance grouping, we propose a novel 3D instance mask decoder that can efficiently extract individual objects. we unify geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into PanoSSC framework and propose new metrics for evaluating panoptic voxels. Extensive experiments show that our method achieves competitive results on SemanticKITTI semantic scene completion benchmark.
