Table of Contents
Fetching ...

MSSF: A 4D Radar and Camera Fusion Framework With Multi-Stage Sampling for 3D Object Detection in Autonomous Driving

Hongsi Liu, Jun Liu, Guangfeng Jiang, Xin Jin

TL;DR

This paper tackles the challenge of 3D object detection with sparse and noisy 4D radar data by fusing radar with camera information. It introduces MSSF, a voxel-image fusion backbone that uses multi-stage sampling through two fusion blocks (SFF and MSDFF) and a semantic-guided head to mitigate feature-blurring and exploit image semantics. The method is shown to achieve substantial improvements over state-of-the-art radar-camera fusion methods on VoD (7.0% mAP) and TJ4DRadSet (4.0% mAP), and even surpasses some LiDAR-based approaches on VoD, while remaining plug-and-play with existing 3D detectors. The work demonstrates robust performance across challenging driving scenarios and lighting conditions, highlighting its practical potential for low-cost autonomous driving perception.

Abstract

As one of the automotive sensors that have emerged in recent years, 4D millimeter-wave radar has a higher resolution than conventional 3D radar and provides precise elevation measurements. But its point clouds are still sparse and noisy, making it challenging to meet the requirements of autonomous driving. Camera, as another commonly used sensor, can capture rich semantic information. As a result, the fusion of 4D radar and camera can provide an affordable and robust perception solution for autonomous driving systems. However, previous radar-camera fusion methods have not yet been thoroughly investigated, resulting in a large performance gap compared to LiDAR-based methods. Specifically, they ignore the feature-blurring problem and do not deeply interact with image semantic information. To this end, we present a simple but effective multi-stage sampling fusion (MSSF) network based on 4D radar and camera. On the one hand, we design a fusion block that can deeply interact point cloud features with image features, and can be applied to commonly used single-modal backbones in a plug-and-play manner. The fusion block encompasses two types, namely, simple feature fusion (SFF) and multiscale deformable feature fusion (MSDFF). The SFF is easy to implement, while the MSDFF has stronger fusion abilities. On the other hand, we propose a semantic-guided head to perform foreground-background segmentation on voxels with voxel feature re-weighting, further alleviating the problem of feature blurring. Extensive experiments on the View-of-Delft (VoD) and TJ4DRadset datasets demonstrate the effectiveness of our MSSF. Notably, compared to state-of-the-art methods, MSSF achieves a 7.0% and 4.0% improvement in 3D mean average precision on the VoD and TJ4DRadSet datasets, respectively. It even surpasses classical LiDAR-based methods on the VoD dataset.

MSSF: A 4D Radar and Camera Fusion Framework With Multi-Stage Sampling for 3D Object Detection in Autonomous Driving

TL;DR

This paper tackles the challenge of 3D object detection with sparse and noisy 4D radar data by fusing radar with camera information. It introduces MSSF, a voxel-image fusion backbone that uses multi-stage sampling through two fusion blocks (SFF and MSDFF) and a semantic-guided head to mitigate feature-blurring and exploit image semantics. The method is shown to achieve substantial improvements over state-of-the-art radar-camera fusion methods on VoD (7.0% mAP) and TJ4DRadSet (4.0% mAP), and even surpasses some LiDAR-based approaches on VoD, while remaining plug-and-play with existing 3D detectors. The work demonstrates robust performance across challenging driving scenarios and lighting conditions, highlighting its practical potential for low-cost autonomous driving perception.

Abstract

As one of the automotive sensors that have emerged in recent years, 4D millimeter-wave radar has a higher resolution than conventional 3D radar and provides precise elevation measurements. But its point clouds are still sparse and noisy, making it challenging to meet the requirements of autonomous driving. Camera, as another commonly used sensor, can capture rich semantic information. As a result, the fusion of 4D radar and camera can provide an affordable and robust perception solution for autonomous driving systems. However, previous radar-camera fusion methods have not yet been thoroughly investigated, resulting in a large performance gap compared to LiDAR-based methods. Specifically, they ignore the feature-blurring problem and do not deeply interact with image semantic information. To this end, we present a simple but effective multi-stage sampling fusion (MSSF) network based on 4D radar and camera. On the one hand, we design a fusion block that can deeply interact point cloud features with image features, and can be applied to commonly used single-modal backbones in a plug-and-play manner. The fusion block encompasses two types, namely, simple feature fusion (SFF) and multiscale deformable feature fusion (MSDFF). The SFF is easy to implement, while the MSDFF has stronger fusion abilities. On the other hand, we propose a semantic-guided head to perform foreground-background segmentation on voxels with voxel feature re-weighting, further alleviating the problem of feature blurring. Extensive experiments on the View-of-Delft (VoD) and TJ4DRadset datasets demonstrate the effectiveness of our MSSF. Notably, compared to state-of-the-art methods, MSSF achieves a 7.0% and 4.0% improvement in 3D mean average precision on the VoD and TJ4DRadSet datasets, respectively. It even surpasses classical LiDAR-based methods on the VoD dataset.

Paper Structure

This paper contains 45 sections, 16 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: An explanation of the feature-blurring problem. (a) elucidates the definitions of 2D foreground points, 3D foreground points, and 3D blurred points. (b) and (c) show the radar points and LiDAR points projected onto the image, respectively. The blue mask is the instance segmentation generated by SAMkirillovSegmentAnything2023, the green points represent the 3D foreground points, and the red points represent the 3D blurred points. (d) illustrates quantitatively by averaging the ratio of the number of 3D foreground points to the number of 2D foreground points over around 200 instance masks for each class. "# 3D fore. pts" represents the number of 3D foreground points, and "# 2D fore. pts" represents the number of 2D foreground points.
  • Figure 2: The overall architecture of our MSSF. The image branch extracts features from images to obtain multi-scale feature maps. The voxel-image fusion backbone contains $n$ fusion blocks and $m$ ordinary blocks (abbreviated as "Block" in the figure), which absorb features from the image feature maps in multiple stages through our proposed fusion blocks. The non-empty voxel features of the last layer fusion block are fed into the semantic-guided head for foreground and background prediction, and the segmentation scores are utilized to weight the voxel features. The multi-scale features output by the last few blocks are passed through the 3D neck to obtain the fused BEV feature map which is sent to the detection head to obtain the final detection results.
  • Figure 3: Two types of fusion blocks. (a) shows the fusion block based on the SFF. The sparse tensor $\mathcal{X}_{n}$ with spatial shape $(D^{V}_{n}, H^{V}_{n}, W^{V}_{n})$ output by the previous block is fed to a sparse convolution layer with stride 2. The centroid of each non-empty voxel (use the center instead if the centroid is not available) is projected onto the image. Image features are then sampled from the multi-scale image feature maps through bilinear interpolation. After concatenation and mapping, the sampled feature $\mathbf{f}_{img}$ is fused with the voxel feature $\mathbf{f}_{vox}$ through $\mathcal{F}_{VI}$, obtaining $\mathbf{f}_{fuse}$. The final output is obtained after several residual blocks. (b) shows the fusion block based on the MSDFF. Unlike (a), for a non-empty voxel, the corresponding query $\mathbf{q}$ is first passed through two parallel linear layers to obtain sampling offsets and weights. Image features are sampled from the multi-scale image feature maps according to the sampling offsets. After the weighted summation and fusion operator, the fused feature $\mathbf{f}_{fuse}$ is obtained. Other processes are consistent with (a).
  • Figure 4: The pillar version of the proposed method.
  • Figure 5: Visualization results on the VoD dataset (best viewed in color and zoom). Each row represents a frame. The first column shows the image, where the orange boxes represent ground truth. The second column shows the detection results of MSSF-PP from the BEV perspective, where the green points are radar points, the red crosses represent the self-vehicle position, the orange boxes represent ground truth bounding boxes and the cyan boxes represent predicted bounding boxes. The third column is the detection results of the single-modal version MSSF-PP-R with the same meaning as the second column. The fourth column shows the visualization results of the segmentation scores output by the semantic-guided head under BEV (darker colors indicate higher scores).
  • ...and 3 more figures