Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Xiaoyu Tian; Tao Jiang; Longfei Yun; Yucheng Mao; Huitong Yang; Yue Wang; Yilun Wang; Hang Zhao

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, Hang Zhao

TL;DR

This work introduces Occ3D, a large-scale benchmark for 3D occupancy prediction built on Waymo and nuScenes, with a rigorous automatic label-generation pipeline that densifies voxels and handles occlusion to produce visibility-aware annotations. It also presents Coarse-to-Fine Occupancy (CTF-Occ), a transformer-based network that fuses multi-view image features into 3D voxel space via cross-attention in a coarse-to-fine manner and employs an incremental token selection strategy for efficiency. Across Occ3D-nuScenes and Occ3D-Waymo, CTF-Occ achieves state-of-the-art IoU/mIoU against several baselines and ablations validate the benefits of the pipeline steps and token strategy. The dataset and code are released to spur research in dense 3D scene understanding, including the handling of General Objects beyond a fixed ontology. Overall, Occ3D advances 3D occupancy prediction toward robust, surrounding-view perception for autonomous driving.

Abstract

Robotic perception requires the modeling of both 3D geometry and semantics. Existing methods typically focus on estimating 3D bounding boxes, neglecting finer geometric details and struggling to handle general, out-of-vocabulary objects. 3D occupancy prediction, which estimates the detailed occupancy states and semantics of a scene, is an emerging task to overcome these limitations. To support 3D occupancy prediction, we develop a label generation pipeline that produces dense, visibility-aware labels for any given scene. This pipeline comprises three stages: voxel densification, occlusion reasoning, and image-guided voxel refinement. We establish two benchmarks, derived from the Waymo Open Dataset and the nuScenes Dataset, namely Occ3D-Waymo and Occ3D-nuScenes benchmarks. Furthermore, we provide an extensive analysis of the proposed dataset with various baseline models. Lastly, we propose a new model, dubbed Coarse-to-Fine Occupancy (CTF-Occ) network, which demonstrates superior performance on the Occ3D benchmarks. The code, data, and benchmarks are released at https://tsinghua-mars-lab.github.io/Occ3D/.

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

TL;DR

Abstract

Paper Structure (25 sections, 9 figures, 5 tables, 3 algorithms)

This paper contains 25 sections, 9 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Occ3D Dataset
Task Definition
Dataset Statistics
Dataset Construction Pipeline
Voxel Densification
Occlusion Reasoning for Visibility Mask
Image-guided Voxel Refinement
Quality Check
3D-2D consistency
Quantitative Results
Coarse-to-Fine Occupancy Network
Experiments
Experimental Setup
...and 10 more sections

Figures (9)

Figure 1: Our Occ3D dataset demonstrates rich semantic and geometric expressiveness. (a) Diversity of scenes in the Occ3D dataset; (b) Out-of-vocabulary objects, also known as General Objects (GOs), that cannot be extensively enumerated in the real world; (c) Irregularly-shaped objects that 3D bounding boxes fail to represent their accurate geometry.
Figure 2: Overview of the label generation pipeline. The pipeline consists of three main steps: voxel densification, occlusion reasoning, and image-guided voxel refinement.Voxel densification consists of object segmentation, multi-frame aggregation, and label assignment.
Figure 3: Visibility and refinement. (a) LiDAR visibility: a voxel is "occupied" if it reflects LiDAR (red voxels), or "free" if it is traversed through by a ray (white voxels); Camera visibility: Any voxel not scanned by camera rays is set to "unobserved" (blue and yellow voxels). (b) Image-guided voxel refinement: during ray casting, when the first voxel with the same semantic label as the pixel label is encountered, we set the previously traversed voxel states to "free" (green voxels).
Figure 4: 3D-2D consistency (a) 2D ROI within single-frame LiDAR scan range. (b) Semantic labels of a single image within the 2D ROI. (c) The reprojection of 3D voxel semantic labels onto the image within the 2D ROI.
Figure 5: The architecture of CTF-Occ network. CTF-Occ consists of an image backbone, a coarse-to-fine voxel encoder, and an implicit occupancy decoder.
...and 4 more figures

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

TL;DR

Abstract

Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (9)