SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection
Bonan Ding, Jin Xie, Jing Nie, Jiale Cao
TL;DR
SSLFusion addresses scale- and space-misalignment in multimodal 3D object detection by integrating a Scale-Aligned Fusion strategy, a 3D-to-2D Space Alignment module, and a Latent Cross-Modal Fusion module. The approach fuses multi-stage 2D and 3D features at each level with a 3D pyramid fusion design and embeds 3D coordinates into 2D features to reduce inter-modal gaps, all while avoiding expensive QKV-based attention through efficient latent interactions with complexity $O(N \cdot c \cdot n)$. Empirical results on KITTI and DENSE show state-of-the-art performance, including a $2.76\%$ and $2.98\%$ improvement in 3D AP on KITTI moderate/hard levels and strong gains under adverse weather on DENSE, demonstrating both accuracy and robustness. The work offers a practical, efficient pathway for scalable, cross-modal perception in autonomous systems, with ablations confirming the value of each component and the overall architecture. $\mathcal{O}$ notation and performance gains are reported with explicit mathematical relationships, highlighting the method's efficiency and effectiveness.
Abstract
Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multi-stage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.
