Table of Contents
Fetching ...

LXLv2: Enhanced LiDAR Excluded Lean 3D Object Detection with Fusion of 4D Radar and Camera

Weiyi Xiong, Zean Zou, Qiuchi Zhao, Fengchun He, Bing Zhu

TL;DR

LXLv2 tackles depth estimation and fusion robustness in 4D radar-camera fusion for 3D object detection. It introduces camera intrinsics embedding and a one-to-many radar-guided depth supervision guided by radar cross section, paired with CSAFusion that jointly applies channel and spatial attention for adaptive fusion. Empirical results on VoD and TJ4DRadSet show LXLv2 surpasses LXL in $mAP_{3D}$ and $mAP_{BEV}$ while reducing inference time, and maintains robustness under challenging lighting. These advances enable more accurate, efficient, and robust autonomous-driving perception without relying on extra data, with potential for online continual learning.

Abstract

As the previous state-of-the-art 4D radar-camera fusion-based 3D object detection method, LXL utilizes the predicted image depth distribution maps and radar 3D occupancy grids to assist the sampling-based image view transformation. However, the depth prediction lacks accuracy and consistency, and the concatenation-based fusion in LXL impedes the model robustness. In this work, we propose LXLv2, where modifications are made to overcome the limitations and improve the performance. Specifically, considering the position error in radar measurements, we devise a one-to-many depth supervision strategy via radar points, where the radar cross section (RCS) value is further exploited to adjust the supervision area for object-level depth consistency. Additionally, a channel and spatial attention-based fusion module named CSAFusion is introduced to improve feature adaptiveness. Experimental results on the View-of-Delft and TJ4DRadSet datasets show that the proposed LXLv2 can outperform LXL in detection accuracy, inference speed and robustness, demonstrating the effectiveness of the model.

LXLv2: Enhanced LiDAR Excluded Lean 3D Object Detection with Fusion of 4D Radar and Camera

TL;DR

LXLv2 tackles depth estimation and fusion robustness in 4D radar-camera fusion for 3D object detection. It introduces camera intrinsics embedding and a one-to-many radar-guided depth supervision guided by radar cross section, paired with CSAFusion that jointly applies channel and spatial attention for adaptive fusion. Empirical results on VoD and TJ4DRadSet show LXLv2 surpasses LXL in and while reducing inference time, and maintains robustness under challenging lighting. These advances enable more accurate, efficient, and robust autonomous-driving perception without relying on extra data, with potential for online continual learning.

Abstract

As the previous state-of-the-art 4D radar-camera fusion-based 3D object detection method, LXL utilizes the predicted image depth distribution maps and radar 3D occupancy grids to assist the sampling-based image view transformation. However, the depth prediction lacks accuracy and consistency, and the concatenation-based fusion in LXL impedes the model robustness. In this work, we propose LXLv2, where modifications are made to overcome the limitations and improve the performance. Specifically, considering the position error in radar measurements, we devise a one-to-many depth supervision strategy via radar points, where the radar cross section (RCS) value is further exploited to adjust the supervision area for object-level depth consistency. Additionally, a channel and spatial attention-based fusion module named CSAFusion is introduced to improve feature adaptiveness. Experimental results on the View-of-Delft and TJ4DRadSet datasets show that the proposed LXLv2 can outperform LXL in detection accuracy, inference speed and robustness, demonstrating the effectiveness of the model.

Paper Structure

This paper contains 15 sections, 16 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall architecture of LXLv2 compared with LXL LXL. Differences lie in the depth estimation process and the fusion module. During depth estimation, camera intrinsics are introduced and radar points are exploited for one-to-many depth supervision, and RCS values are utilized to determine the supervision area. In the fusion module, CSAFusion is applied for improved feature adaptiveness and model robustness.
  • Figure 2: The illustration of maximum position error of radar points.
  • Figure 3: The illustration of neighborhood and object size.
  • Figure 4: The architecture of CSAFusion.
  • Figure 5: Visualization results of LXLLXL and LXLv2 on the VoD VoDval set (best viewed in zoom and color). Each column corresponds to a frame of data, containing an image and radar points (gray points) in BEV, where orange boxes represent ground-truths and blue boxes stand for predicted bounding boxes. The red triangle denotes the position of the ego-vehicle.