Table of Contents
Fetching ...

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

TL;DR

InverseMatrixVT3D introduces a projection-matrix-based pipeline to convert multi-view image features into dense 3D occupancy volumes without depth supervision or transformer queries. By constructing two static projection matrices and applying CSR sparse-matrix optimization, it yields both local 3D volumes and global BEV features that are fused via a global-local attention module, with multi-scale supervision enhancing performance. The approach achieves competitive to state-of-the-art results on nuScenes and SemanticKITTI, with strong VRU detection and a compact model size, while offering favorable memory and speed characteristics. This method provides a simple, efficient alternative for vision-driven 3D occupancy prediction and autonomous driving safety tasks.

Abstract

This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at: https://github.com/DanielMing123/InverseMatrixVT3D

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

TL;DR

InverseMatrixVT3D introduces a projection-matrix-based pipeline to convert multi-view image features into dense 3D occupancy volumes without depth supervision or transformer queries. By constructing two static projection matrices and applying CSR sparse-matrix optimization, it yields both local 3D volumes and global BEV features that are fused via a global-local attention module, with multi-scale supervision enhancing performance. The approach achieves competitive to state-of-the-art results on nuScenes and SemanticKITTI, with strong VRU detection and a compact model size, while offering favorable memory and speed characteristics. This method provides a simple, efficient alternative for vision-driven 3D occupancy prediction and autonomous driving safety tasks.

Abstract

This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at: https://github.com/DanielMing123/InverseMatrixVT3D
Paper Structure (21 sections, 7 equations, 6 figures, 6 tables)

This paper contains 21 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Pipeline of three approaches: Depth-Estimation-based approach (upper), Query-based approach (middle), and our approach (lower). We simplify the generation of the 3D Feature Volume by adopting a matrix multiplication method.
  • Figure 2: Overall architecture of InverseMatrixVT3D. Firstly, the multi-camera images were inputted into the 2D backbone network to extract features at multiple scales. Subsequently, a multi-scale global local projection module was employed to construct multi-scale 3D feature volumes and BEV planes. A global-local attention fusion module was applied to each 3D feature volume and BEV plane at every level to obtain the final 3D feature volume. Finally, the 3D volume at each level was upsampled using 3D deconvolution for skip-connection, and a supervision signal was also applied at each level.
  • Figure 3: The predefined 3D volume feature sampling process. Each voxel grid is initially divided into $N^{3}$ subspaces along the vertical and horizontal directions, and the center of each subspace serves as a sample point. These sample points are then projected onto all the multi-view feature maps to aggregate the corresponding features for the voxel grid.
  • Figure 4: (a) The local projection matrix VT_XYZ represents the static mapping relationship for the feature sampling process of the local 3D feature volume. (b) The global projection matrix VT_XY represents the static mapping relationship for the feature sampling process of the global Bird's Eye View (BEV) feature.
  • Figure 5: The global-local fusion module. The global BEV feature and the local 3D feature volume are refined using traditional convolutional layers. The global BEV feature undergoes additional enhancement through an efficient window attention module and a bottleneck ASPP module. It then merges with the local 3D feature volume, resulting in the final 3D volume.
  • ...and 1 more figures