Fully Sparse 3D Occupancy Prediction
Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, Limin Wang
TL;DR
SparseOcc introduces a fully sparse pipeline for camera-based 3D occupancy prediction, combining a sparse voxel decoder to model only non-free voxels with a mask transformer that predicts semantic and instance occupancy via sparse queries. It eliminates dense 3D features and sparse-to-dense modules, and pairs this with RayIoU, a ray-based evaluation metric that mitigates depth-consistency issues common in voxel IoU. The approach demonstrates strong efficiency and accuracy on Occ3D-nuScenes, achieving a RayIoU of 34.0–35.1 across frames while running in real time (17.3–35.1 FPS depending on temporal settings). Additionally, the work explores ablations, temporal fusion, and panoptic extensions, highlighting the practicality and limitations of fully sparse 3D occupancy from monocular inputs. The results suggest meaningful advances for scalable occupancy perception in autonomous systems, with potential for real-time deployment and panoptic occupancy tasks.
Abstract
Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering from high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from camera-only inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along the depth axis raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without bells and whistles.
