Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving
Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, Hang Zhao
TL;DR
This work introduces Occ3D, a large-scale benchmark for 3D occupancy prediction built on Waymo and nuScenes, with a rigorous automatic label-generation pipeline that densifies voxels and handles occlusion to produce visibility-aware annotations. It also presents Coarse-to-Fine Occupancy (CTF-Occ), a transformer-based network that fuses multi-view image features into 3D voxel space via cross-attention in a coarse-to-fine manner and employs an incremental token selection strategy for efficiency. Across Occ3D-nuScenes and Occ3D-Waymo, CTF-Occ achieves state-of-the-art IoU/mIoU against several baselines and ablations validate the benefits of the pipeline steps and token strategy. The dataset and code are released to spur research in dense 3D scene understanding, including the handling of General Objects beyond a fixed ontology. Overall, Occ3D advances 3D occupancy prediction toward robust, surrounding-view perception for autonomous driving.
Abstract
Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. 3D occupancy prediction, which estimates the detailed occupancy states and semantics of a scene, is an emerging task to overcome these limitations. To support 3D occupancy prediction, we develop a label generation pipeline that produces dense, visibility-aware labels for any given scene. This pipeline comprises three stages: voxel densification, occlusion reasoning, and image-guided voxel refinement. We establish two benchmarks, derived from the Waymo Open Dataset and the nuScenes Dataset, namely Occ3D-Waymo and Occ3D-nuScenes benchmarks. Furthermore, we provide an extensive analysis of the proposed dataset with various baseline models. Lastly, we propose a new model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance on the Occ3D benchmarks. The code, data, and benchmarks are released at https://tsinghua-mars-lab.github.io/Occ3D/.
