PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, Zhaoxiang Zhang
TL;DR
This work introduces PanoOcc, a camera-only approach for 3D panoptic segmentation that unifies object-level and semantic occupancy in a single 3D voxel representation. It leverages learnable 3D voxel queries, a coarse-to-fine occupancy decoder, and temporal fusion across multiple frames, combined with a multi-task training scheme for detection and segmentation. Thorough ablations demonstrate the importance of height-aware voxel queries, 3D voxel-based representations, and temporal information, achieving state-of-the-art results on nuScenes camera-based semantic and panoptic segmentation and strong occupancy predictions on Occ3D. The method also emphasizes memory efficiency through occupancy sparsification, making dense 3D scene understanding more practical for autonomous driving and paving the way for end-to-end holistic 3D perception from monocular video.
Abstract
Comprehensive modeling of the surrounding 3D world is key to the success of autonomous driving. However, existing perception tasks like object detection, road structure segmentation, depth & elevation estimation, and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development procedure at the cost of losing an end-to-end unified solution to the problem. In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this, we introduce a novel method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme, integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to verify the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore, our method can be easily extended to dense occupancy prediction and has shown promising performance on the Occ3D benchmark. The code will be released at https://github.com/Robertwyq/PanoOcc.
