InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction
Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall
TL;DR
InverseMatrixVT3D introduces a projection-matrix-based pipeline to convert multi-view image features into dense 3D occupancy volumes without depth supervision or transformer queries. By constructing two static projection matrices and applying CSR sparse-matrix optimization, it yields both local 3D volumes and global BEV features that are fused via a global-local attention module, with multi-scale supervision enhancing performance. The approach achieves competitive to state-of-the-art results on nuScenes and SemanticKITTI, with strong VRU detection and a compact model size, while offering favorable memory and speed characteristics. This method provides a simple, efficient alternative for vision-driven 3D occupancy prediction and autonomous driving safety tasks.
Abstract
This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at: https://github.com/DanielMing123/InverseMatrixVT3D
