Fully Sparse 3D Occupancy Prediction

Haisong Liu; Yang Chen; Haiguang Wang; Zetong Yang; Tianyu Li; Jia Zeng; Li Chen; Hongyang Li; Limin Wang

Fully Sparse 3D Occupancy Prediction

Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, Limin Wang

TL;DR

SparseOcc introduces a fully sparse pipeline for camera-based 3D occupancy prediction, combining a sparse voxel decoder to model only non-free voxels with a mask transformer that predicts semantic and instance occupancy via sparse queries. It eliminates dense 3D features and sparse-to-dense modules, and pairs this with RayIoU, a ray-based evaluation metric that mitigates depth-consistency issues common in voxel IoU. The approach demonstrates strong efficiency and accuracy on Occ3D-nuScenes, achieving a RayIoU of 34.0–35.1 across frames while running in real time (17.3–35.1 FPS depending on temporal settings). Additionally, the work explores ablations, temporal fusion, and panoptic extensions, highlighting the practicality and limitations of fully sparse 3D occupancy from monocular inputs. The results suggest meaningful advances for scalable occupancy perception in autonomous systems, with potential for real-time deployment and panoptic occupancy tasks.

Abstract

Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering from high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from camera-only inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along the depth axis raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without bells and whistles.

Fully Sparse 3D Occupancy Prediction

TL;DR

Abstract

Paper Structure (34 sections, 4 equations, 10 figures, 5 tables)

This paper contains 34 sections, 4 equations, 10 figures, 5 tables.

Introduction
Related Work
Camera-based 3D Occupancy Prediction.
Sparse Architectures for 3D Vision.
End-to-end 3D Reconstruction from Posed Images.
Mask Transformer.
SparseOcc
Sparse Voxel Decoder
Overall architecture.
Detailed design.
Temporal modeling.
Supervision.
Mask Transformer
Mask-guided sparse sampling.
Prediction.
...and 19 more sections

Figures (10)

Figure 1: (a) SparseOcc reconstructs a sparse 3D representation from camera-only inputs by a sparse voxel decoder, and then estimates the mask and label of each segment via a set of sparse queries. (b) Performance comparison on the validation split of Occ3D-nuScenes. FPS is measured on a Tesla A100 with the PyTorch fp32 backend.
Figure 2: SparseOcc is a fully sparse architecture since it neither relies on dense 3D feature, nor has sparse-to-dense and global attention operations. The sparse voxel decoder reconstructs the sparse geometry of the scene, consisting of $K$ voxels ($K \ll W \times H \times D$). The mask transformer then uses $N$ sparse queries to predict the mask and label of each segment. SparseOcc can be easily extended to panoptic occupancy by replacing the semantic queries with instance queries.
Figure 3: The sparse voxel decoder employs a coarse-to-fine pipeline with three layers. Within each layer, we utilize a transformer-like architecture for 3D-2D interaction. At the end of every layer, the voxel resolution is upsampled by a factor of 2$\times$, and probabilities of voxel occupancy are estimated.
Figure 4: Visualization of the discrepancy between qualitative and quantitative results. We observe that training existing dense occupancy methods (e.g. BEVFormer) with a visible mask results in a thick surface, leading to an unreasonably inflated improvement in the current mIoU metrics. In contrast, our new RayIoU metrics provide a more accurate reflection of model performance.
Figure 5: Illustration of inconsistent depth penalties caused by current metrics. Consider a scenario where we have a wall in front of us, with a ground-truth distance of $d$ and a thickness of $d_v$. When the prediction has a thickness of $d_p \gg d_v$, we encounter an inconsistent penalty along depth. Specifically, if the predicted wall is $d_v$ farther than the ground truth (total distance $d + d_v$), its IoU will be zero. Conversely, if the predicted wall is $d_v$ closer than the ground truth (total distance $d - d_v$), the IoU remains at 0.5. This occurs because all voxels behind the surface are filled with duplicated predictions. Similarly, when the predicted depth is $d - 2d_v$, the resulting IoU is $\frac{1}{3}$, and so forth.
...and 5 more figures

Fully Sparse 3D Occupancy Prediction

TL;DR

Abstract

Fully Sparse 3D Occupancy Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (10)