Table of Contents
Fetching ...

Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection

Xuzhong Hu, Zaipeng Duan, Pei An, Jun zhang, Jie Ma

TL;DR

DDHFusion tackles the challenge of fusing LiDAR and camera data for robust 3D object detection by jointly exploiting sparse voxel and BEV representations. It introduces HVF/HBF with cross-modal Mamba blocks, SAFS for efficient image-to-voxel sampling, and PQG plus a Progressive Decoder with MMVFM to improve recall and localization. Empirical results on NuScenes show state-of-the-art mAP and NDS, with ablations validating the contribution of each component. The approach offers strong accuracy with efficient inference, highlighting the practicality of dual-domain homogeneous fusion for autonomous driving.

Abstract

Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving. However, this paradigm is constrained by the excessive feature compression. While some works explore dense voxel fusion to enable better feature interaction, they face high computational costs and challenges in query generation. Additionally, feature misalignment in both domains results in suboptimal detection accuracy. To address these limitations, we propose a Dual-Domain Homogeneous Fusion network (DDHFusion), which leverages the complementarily of both BEV and voxel domains while mitigating their drawbacks. Specifically, we first transform image features into BEV and sparse voxel representations using lift-splat-shot and our proposed Semantic-Aware Feature Sampling (SAFS) module. The latter significantly reduces computational overhead by discarding unimportant voxels. Next, we introduce Homogeneous Voxel and BEV Fusion (HVF and HBF) networks for multi-modal fusion within respective domains. They are equipped with novel cross-modal Mamba blocks to resolve feature misalignment and enable comprehensive scene perception. The output voxel features are injected into the BEV space to compensate for the information loss brought by direct height compression. During query selection, the Progressive Query Generation (PQG) mechanism is implemented in the BEV domain to reduce false negatives caused by feature compression. Furthermore, we propose a Progressive Decoder (QD) that sequentially aggregates not only context-rich BEV features but also geometry-aware voxel features with deformable attention and the Multi-Modal Voxel Feature Mixing (MMVFM) block for precise classification and box regression.

Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection

TL;DR

DDHFusion tackles the challenge of fusing LiDAR and camera data for robust 3D object detection by jointly exploiting sparse voxel and BEV representations. It introduces HVF/HBF with cross-modal Mamba blocks, SAFS for efficient image-to-voxel sampling, and PQG plus a Progressive Decoder with MMVFM to improve recall and localization. Empirical results on NuScenes show state-of-the-art mAP and NDS, with ablations validating the contribution of each component. The approach offers strong accuracy with efficient inference, highlighting the practicality of dual-domain homogeneous fusion for autonomous driving.

Abstract

Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving. However, this paradigm is constrained by the excessive feature compression. While some works explore dense voxel fusion to enable better feature interaction, they face high computational costs and challenges in query generation. Additionally, feature misalignment in both domains results in suboptimal detection accuracy. To address these limitations, we propose a Dual-Domain Homogeneous Fusion network (DDHFusion), which leverages the complementarily of both BEV and voxel domains while mitigating their drawbacks. Specifically, we first transform image features into BEV and sparse voxel representations using lift-splat-shot and our proposed Semantic-Aware Feature Sampling (SAFS) module. The latter significantly reduces computational overhead by discarding unimportant voxels. Next, we introduce Homogeneous Voxel and BEV Fusion (HVF and HBF) networks for multi-modal fusion within respective domains. They are equipped with novel cross-modal Mamba blocks to resolve feature misalignment and enable comprehensive scene perception. The output voxel features are injected into the BEV space to compensate for the information loss brought by direct height compression. During query selection, the Progressive Query Generation (PQG) mechanism is implemented in the BEV domain to reduce false negatives caused by feature compression. Furthermore, we propose a Progressive Decoder (QD) that sequentially aggregates not only context-rich BEV features but also geometry-aware voxel features with deformable attention and the Multi-Modal Voxel Feature Mixing (MMVFM) block for precise classification and box regression.

Paper Structure

This paper contains 18 sections, 21 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Overview of homogeneous fusion methods. The view transformation in Fig. \ref{['fig:comp']}(a) and Fig. \ref{['fig:comp']}(b) causes the loss of modality details. The dense voxel representation in Fig. \ref{['fig:comp']}(c) causes computational burden. Our method in Fig. \ref{['fig:comp']}(d) combines the advantages of sparse voxel and BEV domains.
  • Figure 2: Overview of DDHFusion. It begins by extracting features from multi-view images and LiDAR points and transforms both into the voxel and BEV domains. These features are then passed to two Mamba-based homogeneous fusion networks, which perform feature alignment and global perception. The resulting high-quality fused BEV feature $B_{out}$ is then fed into the progressive query generation module, which leverages the spatial relationship of easy queries to stimulate the generation of hard queries. Finally, all queries are passed into the progressive decoder, which abstracts dual-domain features around instances for accurate classification and box regression.
  • Figure 3: Details of semantic-aware feature sampling (SAFS). We filter the coordinates of unimportant voxels by depth and semantic scores and transform the image feature to the sparse voxel domain.
  • Figure 4: Details of Mamba-based Voxel Fusion. In the left image, for the purpose of visualization, we use a 2D illustration to represent 3D scanning patterns in the HVF. "Z*2" and "Z/2" denote doubling or halving the z indices of LiDAR voxels (blue) to align them with image voxels (yellow) or to recover them to their original coordinates. As illustrated in the right figure, we utilize the bidirectional Mamba block in both intra-modal and cross-modal voxel Mamba modules.
  • Figure 5: Details of Intra-Modal and Cross-Modal BEV Mamba. We apply the four-directional scanning to unfold the BEV feature into 1D sequences, which helps construct a comprehensive spatial relationship. In the cross-modal Mamba block, the parameters for the SS2D modules of each modality are jointly computed from the image and LiDAR BEV features, while $Z_I$ and $Z_L$ are used to adaptively modulate the weights of their respective modalities.
  • ...and 4 more figures