Table of Contents
Fetching ...

OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, Xingang Wang

TL;DR

OpenOccupancy pioneers surrounding semantic occupancy perception by introducing nuScenes-Occupancy with a dense annotation workflow via the Augmenting And Purifying pipeline, enabling high-resolution 3D scene labeling across 360 degrees. It establishes camera, LiDAR, and multimodal baselines and proposes CONet, a coarse-to-fine refinement network, to address the computational burden of high-resolution occupancy predictions. Empirical results demonstrate that surround-view approaches outperform front-view methods, multimodal fusion yields substantial gains, and CONet provides about a 30% improvement with modest overhead. This benchmark and approach aim to accelerate robust, real-time surrounding occupancy perception for autonomous driving.

Abstract

Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by ~30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms.

OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception

TL;DR

OpenOccupancy pioneers surrounding semantic occupancy perception by introducing nuScenes-Occupancy with a dense annotation workflow via the Augmenting And Purifying pipeline, enabling high-resolution 3D scene labeling across 360 degrees. It establishes camera, LiDAR, and multimodal baselines and proposes CONet, a coarse-to-fine refinement network, to address the computational burden of high-resolution occupancy predictions. Empirical results demonstrate that surround-view approaches outperform front-view methods, multimodal fusion yields substantial gains, and CONet provides about a 30% improvement with modest overhead. This benchmark and approach aim to accelerate robust, real-time surrounding occupancy perception for autonomous driving.

Abstract

Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to ~2x densify the annotations, where ~4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by ~30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms.
Paper Structure (15 sections, 8 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 8 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The nuScenes-Occupancy provides dense semantic occupancy labels for all key frames in the nuScenes nusc dataset. Here we showcase the annotated ground truth with the volumetric size of $(40\times 512\times 512)$ and grid size of 0.2 m.
  • Figure 2: Comparison between the initial, pseudo and the augmented-and-purified annotation, where regions highlighted by red and blue circle indicate that the augmented annotation is more dense and accurate.
  • Figure 3: Overall architecture of three proposed baselines. The LiDAR branch utilizes 3D encoder to extract voxelized LiDAR features, and the camera branch uses 2D encoder to learn surround-view features, which are then transformed to generate 3D camera voxel features. In the multi-modal branch, the adaptive fusion module dynamically integrates features from two modalities. All three branches leverage 3D decoder and occupancy head to produce semantic occupancy. In the occupancy results figures, regions highlighted by red and purple circles indicate that the multi-modal branch can generate more complete and accurate predictions (better viewed when zoomed in).
  • Figure 4: Overall framework of the multi-modal CONet. (1) The coarse occupancy is first generated by the multi-modal baseline. (2) Then the occupied voxels are split to produce high-resolution occupancy queries. (3) Subsequently, we project queries to sample from 2D image features and 3D voxel features. The sampled features are fused and regularized by Fully-Connected (FC) layers to generate fine-grained occupancy predictions.
  • Figure 5: Visualization of the semantic occupancy predictions, where the 1st row is surround-view images. In 2nd and 3rd rows, we show the camera view of coarse and fine occupancy generated by the multi-modal baseline and multi-modal CONet. In the 4th row, we compare their global-view predictions.