Table of Contents
Fetching ...

SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection

Bonan Ding, Jin Xie, Jing Nie, Jiale Cao

TL;DR

SSLFusion addresses scale- and space-misalignment in multimodal 3D object detection by integrating a Scale-Aligned Fusion strategy, a 3D-to-2D Space Alignment module, and a Latent Cross-Modal Fusion module. The approach fuses multi-stage 2D and 3D features at each level with a 3D pyramid fusion design and embeds 3D coordinates into 2D features to reduce inter-modal gaps, all while avoiding expensive QKV-based attention through efficient latent interactions with complexity $O(N \cdot c \cdot n)$. Empirical results on KITTI and DENSE show state-of-the-art performance, including a $2.76\%$ and $2.98\%$ improvement in 3D AP on KITTI moderate/hard levels and strong gains under adverse weather on DENSE, demonstrating both accuracy and robustness. The work offers a practical, efficient pathway for scalable, cross-modal perception in autonomous systems, with ablations confirming the value of each component and the overall architecture. $\mathcal{O}$ notation and performance gains are reported with explicit mathematical relationships, highlighting the method's efficiency and effectiveness.

Abstract

Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multi-stage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.

SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection

TL;DR

SSLFusion addresses scale- and space-misalignment in multimodal 3D object detection by integrating a Scale-Aligned Fusion strategy, a 3D-to-2D Space Alignment module, and a Latent Cross-Modal Fusion module. The approach fuses multi-stage 2D and 3D features at each level with a 3D pyramid fusion design and embeds 3D coordinates into 2D features to reduce inter-modal gaps, all while avoiding expensive QKV-based attention through efficient latent interactions with complexity . Empirical results on KITTI and DENSE show state-of-the-art performance, including a and improvement in 3D AP on KITTI moderate/hard levels and strong gains under adverse weather on DENSE, demonstrating both accuracy and robustness. The work offers a practical, efficient pathway for scalable, cross-modal perception in autonomous systems, with ablations confirming the value of each component and the overall architecture. notation and performance gains are reported with explicit mathematical relationships, highlighting the method's efficiency and effectiveness.

Abstract

Multimodal 3D object detection based on deep neural networks has indeed made significant progress. However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds. Existing methods usually aggregate multimodal features at a single stage. However, leveraging multi-stage cross-modal features is crucial for detecting objects of various scales. Therefore, these methods often struggle to integrate features across different scales and modalities effectively, thereby restricting the accuracy of detection. Additionally, the time-consuming Query-Key-Value-based (QKV-based) cross-attention operations often utilized in existing methods aid in reasoning the location and existence of objects by capturing non-local contexts. However, this approach tends to increase computational complexity. To address these challenges, we present SSLFusion, a novel Scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy (SAF), a 3D-to-2D space alignment module (SAM), and a latent cross-modal fusion module (LFM). SAF mitigates scale misalignment between modalities by aggregating features from both images and point clouds across multiple levels. SAM is designed to reduce the inter-modal gap between features from images and point clouds by incorporating 3D coordinate information into 2D image features. Additionally, LFM captures cross-modal non-local contexts in the latent space without utilizing the QKV-based attention operations, thus mitigating computational complexity. Experiments on the KITTI and DENSE datasets demonstrate that our SSLFusion outperforms state-of-the-art methods. Our approach obtains an absolute gain of 2.15% in 3D AP, compared with the state-of-art method GraphAlign on the moderate level of the KITTI test set.

Paper Structure

This paper contains 14 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of our fusion strategy with other methods. Feature-level fusion methods can be divided into two categories: (a) Early-Fusion methods fuse multi-scale image feature maps with the first-stage voxel features. (b) Late-Fusion methods employ depth estimation to transform multi-scale 2D features into 3D space or BEV space, subsequently fusing them with 3D features in those spaces at a single stage. (c) In contrast, our approach fuses multi-stage multi-scale 2D and 3D features in an alignment manner, as opposed to the single-stage fusion in categories (a) and (b).
  • Figure 2: The overall architecture of SSLFusion. Our model consists of four parts: Lidar branch and image branch to extract Lidar and image features, respectively; the proposed Latent Cross-Modal Fusion module with Space Alignment fuses LiDAR and image features at each stage in space aligned manner; and the 3D object detection head generates 3D object detection results based on the multi-level fusion features.
  • Figure 3: Description of the alignment fusion of different levels of image features and voxels. (a) and (b) demonstrate the alignment relationship between the pixels of distant objects of images and the voxels of the first stage of the 3D backbone. The voxels contained in distant objects are fewer, and their quantity further reduces with the downsampling of 3D convolutions. (c) and (d) depict the feature attention of the first and fourth-level image feature obtained by the image backbone on objects. It can be observed from the figure that the image features of level 1 have foreground features for distant objects, while stage 4 only has background features. Thus, fusing the Stage 1 voxel in 3D with the image features of Stage 4 would introduce noisy features.
  • Figure 4: Structure of Latent Cross-Modal Fusion Module with 3D-to-2D Space Alignment.
  • Figure 5: Structure of Efficient Cross-Modal Interaction.