Table of Contents
Fetching ...

RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving

Yue Sun, Yeqiang Qian, Zhe Wang, Tianhui Li, Chunxiang Wang, Ming Yang

Abstract

Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The "X" highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.

RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving

Abstract

Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The "X" highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.
Paper Structure (25 sections, 11 equations, 6 figures, 5 tables)

This paper contains 25 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The overall architecture of RadarXFormer, where multi-scale 3D radar and 2D image features are extracted by the encoders in the Feature Extraction module, fused with 3D object queries through multi-scale deformable attention in the Cross-Dimension Feature Fusion module, and further refined via iterative query and detection processes for accurate 3D object estimation.
  • Figure 2: Visualization of an example from the K-Radar dataset showing the raw 4D radar spectrum and the proposed 3D radar representation. The symbol ⓒ indicates channel-wise concatenation.
  • Figure 3: Visualization of the same radar spectrum in BEV under different coordinates and processing stages. (a) Radar spectrum in Cartesian coordinates without interpolation; (b) Cartesian spectrum with interpolation; (c) raw RA map in spherical coordinates; (d) RA map after coarse filtering; (e) RA map after coarse filtering and CFAR. Pink boxes indicate ground-truth objects.
  • Figure 4: Design details of the Cross-Dimension Feature Fusion module. MHSA denotes multi-head self-attention, while MSDA denotes (multi-head) multi-scale deformable attention.
  • Figure 5: Example results of radar–camera 3D object detection models under various weather conditions. Ground truth and predictions are shown in pink and green. EA and RA maps are displayed on the left.
  • ...and 1 more figures