Table of Contents
Fetching ...

Robust 3D Object Detection from LiDAR-Radar Point Clouds via Cross-Modal Feature Augmentation

Jianning Deng, Gabriel Chan, Hantao Zhong, Chris Xiaoxuan Lu

TL;DR

The paper tackles robustness in 3D object detection by enabling agnostic cross-modal learning between LiDAR and 4D radar. It introduces a holistic framework with a center-aware backbone, instance feature aggregation, alignment-aware projection to a shared latent space, and selective cross-modal matching, trained via a two-step process and enabling single-modal inference at test time. Empirical results on the VoD dataset show state-of-the-art performance for both radar and LiDAR tasks, with notable gains on small objects and strong runtime efficiency. The work demonstrates the practical viability of cross-modal supervision to strengthen single-sensor detectors in autonomous driving contexts.

Abstract

This paper presents a novel framework for robust 3D object detection from point clouds via cross-modal hallucination. Our proposed approach is agnostic to either hallucination direction between LiDAR and 4D radar. We introduce multiple alignments on both spatial and feature levels to achieve simultaneous backbone refinement and hallucination generation. Specifically, spatial alignment is proposed to deal with the geometry discrepancy for better instance matching between LiDAR and radar. The feature alignment step further bridges the intrinsic attribute gap between the sensing modalities and stabilizes the training. The trained object detection models can deal with difficult detection cases better, even though only single-modal data is used as the input during the inference stage. Extensive experiments on the View-of-Delft (VoD) dataset show that our proposed method outperforms the state-of-the-art (SOTA) methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime. Code is available at https://github.com/DJNing/See_beyond_seeing.

Robust 3D Object Detection from LiDAR-Radar Point Clouds via Cross-Modal Feature Augmentation

TL;DR

The paper tackles robustness in 3D object detection by enabling agnostic cross-modal learning between LiDAR and 4D radar. It introduces a holistic framework with a center-aware backbone, instance feature aggregation, alignment-aware projection to a shared latent space, and selective cross-modal matching, trained via a two-step process and enabling single-modal inference at test time. Empirical results on the VoD dataset show state-of-the-art performance for both radar and LiDAR tasks, with notable gains on small objects and strong runtime efficiency. The work demonstrates the practical viability of cross-modal supervision to strengthen single-sensor detectors in autonomous driving contexts.

Abstract

This paper presents a novel framework for robust 3D object detection from point clouds via cross-modal hallucination. Our proposed approach is agnostic to either hallucination direction between LiDAR and 4D radar. We introduce multiple alignments on both spatial and feature levels to achieve simultaneous backbone refinement and hallucination generation. Specifically, spatial alignment is proposed to deal with the geometry discrepancy for better instance matching between LiDAR and radar. The feature alignment step further bridges the intrinsic attribute gap between the sensing modalities and stabilizes the training. The trained object detection models can deal with difficult detection cases better, even though only single-modal data is used as the input during the inference stage. Extensive experiments on the View-of-Delft (VoD) dataset show that our proposed method outperforms the state-of-the-art (SOTA) methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime. Code is available at https://github.com/DJNing/See_beyond_seeing.
Paper Structure (15 sections, 6 equations, 6 figures, 6 tables)

This paper contains 15 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Fig. \ref{['fig:proposed_method']} illustrates the proposed method with the 4D radar input as an example. Fig. \ref{['fig:open_fig_ctp']} - Fig. \ref{['fig:open_fig_ours']} are the visualization of radar detection in the same scene of different methods. Ground truth boxes are denoted in red, the false detections are denoted in yellow and the correct detections are denoted in green. RGB images are only used for visualization.
  • Figure 2: Method Overview. The upper figure illustrates the 2-step training strategy, blocks used in the first step of training are connected with green line and those for the second step are connected with orange line. Note the primary and auxiliary data can be interchangeable among two sensor modalities (radar and LiDAR) depending on the end goal. Only single modal data (primary modal) will be used during inference as shown in the lower figure connected with blue line.
  • Figure 3: Point-level matches are difficult to obtain without centroid generation in the instance feature aggregation module due to sparsity. The black bounding boxes are the object ground truth labels. (Best viewed in color and zooming in).
  • Figure 4: Here is the illustration and visualization of the selective matching during training. In Fig. \ref{['fig:training_vis']}, we can see that all matched points (cyan points) are positioned near the bounding box center for co-visible objects, which demonstrates the effectiveness of the instance feature aggregation module. For wrongly sampled centered points, they will be moved to random positions and will not interfere with the training process (magenta points).
  • Figure 5: Qualitative result of our method. Bounding boxes for GTs are denoted in red, and the predictions are denoted in green. The left images and the middle figures are the radar detection results. Notice that the RGB images here are only for visualization purposes but not used in model training/inference. The right figures visualize the LiDAR point clouds and the prediction results.
  • ...and 1 more figures