Scene as Occupancy

Chonghao Sima; Wenwen Tong; Tai Wang; Li Chen; Silei Wu; Hanming Deng; Yi Gu; Lewei Lu; Ping Luo; Dahua Lin; Hongyang Li

Scene as Occupancy

Chonghao Sima, Wenwen Tong, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, Hongyang Li

TL;DR

The paper argues that dense 3D occupancy with semantic labeling offers a geometry-aware scene descriptor that surpasses traditional 3D boxes for driving tasks. It introduces OccNet, a vision-centric pipeline with a cascade voxel decoder and temporal cues to reconstruct 3D occupancy, and OpenOcc, a dense occupancy benchmark built on nuScenes. Empirical results show occupancy-based representations improve semantic scene completion, support accurate 3D detection via occupancy pretraining, and substantially reduce planning collision rates (15%–58%), highlighting the practical value for vision-centric autonomous driving. Overall, the work positions 3D occupancy as a versatile foundation for perception and planning, backed by a dense, publicly available benchmark to drive further research.

Abstract

Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method.

Scene as Occupancy

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 12 figures, 15 tables)

This paper contains 31 sections, 2 equations, 12 figures, 15 tables.

Introduction
Related Work
Methodology
Cascade Voxel Decoder
Exploiting Occupancy on Various Tasks
OpenOcc: 3D Occupancy Benchmark
Benchmark Overview
Generating High-quality Annotation
Experiments
Main Results
Discussion
Conclusion
Evaluation Metrics
More Related Work
Implementation Detail of OccNet
...and 16 more sections

Figures (12)

Figure 1: Scene as Occupancy. Representing objects as ViDAR (a) or 3D occupancy (b) has been endorsed by industry mobile2020cestesla_ai_day, due to the fact that conventional 3D bounding box cannot describe in detail irregular vehicles in daily driving scenes, e.g., protruding tail in (a) or (c). Defining the 3D world as Occupancy in (d) serves better to represent obstacles and avoid collision. In this paper, we envision Occupancy as a general Scene Descriptor as in (e) for a wide span of driving tasks beyond detection, such as planning, and witness performance gain compared to previous alternatives.
Figure 2: OccNet pipeline. The core of OccNet is to obtain a representative Occupancy Descriptor and apply it for various driving tasks. Our proposed algorithm consists of two stages. I. Reconstruction of Occupancy. Given multiple visual inputs, we first generate features from the BEV encoder. Voxel Decoder is performed in a cascade fashion where voxels are refined progressively. A 3D deformable attention (att.) unit serves similar functionality as does in 2D case. Temporal voxels $V_{t-1}$ are also incorporated. Some connections are omitted for brevity. See context for details. II. Exploitation of Occupancy. Equipped with the occupancy descriptor, we can proceed tasks including semantic scene completion and 3D object detection. Compacting them in BEV space would obtain a BEV segmentation map, which can be directly fed into the planning pipeline st-p32022. Such a design can ensure desirable improvement in planning task.
Figure 3: Visual comparison on 3D occupancy annotations. Compared to (a) sparse occupancy huang2023tri and (b) OccData Fang2023, we generate (c) dense and high-quality annotations with (d) the additional flow annotation of foreground objects, which can be applied for motion planning.
Figure 4: Qualitative results of occupancy prediction. Our method outperforms TPVFormer huang2023tri in terms of scene details and the semantic classification accuracy of foreground objects, such as the pedestrian in the dashed region.
Figure 5: The comparison of detector performance using different pretained models and different scale of training dataset. OccNet (sparse) and OccNet (dense) means the OccNet trained on sparse and dense occupancy data respectively. Best view in color.
...and 7 more figures

Scene as Occupancy

TL;DR

Abstract

Scene as Occupancy

Authors

TL;DR

Abstract

Table of Contents

Figures (12)