Table of Contents
Fetching ...

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, Zhaoxiang Zhang

TL;DR

This work introduces PanoOcc, a camera-only approach for 3D panoptic segmentation that unifies object-level and semantic occupancy in a single 3D voxel representation. It leverages learnable 3D voxel queries, a coarse-to-fine occupancy decoder, and temporal fusion across multiple frames, combined with a multi-task training scheme for detection and segmentation. Thorough ablations demonstrate the importance of height-aware voxel queries, 3D voxel-based representations, and temporal information, achieving state-of-the-art results on nuScenes camera-based semantic and panoptic segmentation and strong occupancy predictions on Occ3D. The method also emphasizes memory efficiency through occupancy sparsification, making dense 3D scene understanding more practical for autonomous driving and paving the way for end-to-end holistic 3D perception from monocular video.

Abstract

Comprehensive modeling of the surrounding 3D world is key to the success of autonomous driving. However, existing perception tasks like object detection, road structure segmentation, depth & elevation estimation, and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development procedure at the cost of losing an end-to-end unified solution to the problem. In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this, we introduce a novel method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme, integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to verify the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore, our method can be easily extended to dense occupancy prediction and has shown promising performance on the Occ3D benchmark. The code will be released at https://github.com/Robertwyq/PanoOcc.

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

TL;DR

This work introduces PanoOcc, a camera-only approach for 3D panoptic segmentation that unifies object-level and semantic occupancy in a single 3D voxel representation. It leverages learnable 3D voxel queries, a coarse-to-fine occupancy decoder, and temporal fusion across multiple frames, combined with a multi-task training scheme for detection and segmentation. Thorough ablations demonstrate the importance of height-aware voxel queries, 3D voxel-based representations, and temporal information, achieving state-of-the-art results on nuScenes camera-based semantic and panoptic segmentation and strong occupancy predictions on Occ3D. The method also emphasizes memory efficiency through occupancy sparsification, making dense 3D scene understanding more practical for autonomous driving and paving the way for end-to-end holistic 3D perception from monocular video.

Abstract

Comprehensive modeling of the surrounding 3D world is key to the success of autonomous driving. However, existing perception tasks like object detection, road structure segmentation, depth & elevation estimation, and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development procedure at the cost of losing an end-to-end unified solution to the problem. In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this, we introduce a novel method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme, integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to verify the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore, our method can be easily extended to dense occupancy prediction and has shown promising performance on the Occ3D benchmark. The code will be released at https://github.com/Robertwyq/PanoOcc.
Paper Structure (17 sections, 10 equations, 5 figures, 14 tables)

This paper contains 17 sections, 10 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Comparison of different tasks for 3D scene understanding. (a) LiDAR panoptic segmentation: Given sparse LiDAR points as input, the model outputs panoptic prediction on sparse LiDAR points. (b) Camera Detection and Segmentation: Given multi-view images, separate models are used to detect objects and perform BEV semantic segmentation. (c) Camera panoptic segmentation: Given multi-view images, a single model is trained to output dense panoptic occupancy predictions.
  • Figure 2: The overall framework of PanoOcc. We employ an image backbone network to extract multi-scale features for multi-view images at multi-frames. Then we apply voxel queries to learn the voxel features via View Encoder. The Temporal Encoder aligns the previous voxel features into the current frame and fuses the features together. Voxel Upsample restores the high-resolution voxel representation for fine-grained semantic classification. Task Head predicts object detection and semantic segmentation by two separate heads. Refine Module further refines the thing class prediction with the help of 3D object detection and assigns the instance ID to the thing-occupied grids. Finally, we can obtain 3D panoptic segmentation for the current frame.
  • Figure 3: Illustration of occupancy sparsify. It serves as an optional technique to boost efficiency. We use BEV representation for simple illustration, while it is actually a 3D process. The light yellow region will be pruned according to occupancy masks.
  • Figure 4: Qualitative results on nuScenes validation set. Our PanoOcc takes multi-view images as input and produces voxel predictions, which are visualized at a resolution of 200x200x32. We evaluate 3D segmentic segmentation and panoptic segmentation on LiDAR points.
  • Figure 5: Qualitative results on Occ3D-nuScenes validation set. Our PanoOcc takes multi-view images as input and produces dense occupancy predictions, which are visualized at the resolution of 200x200x16.