Table of Contents
Fetching ...

Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

Linhua Kong, Dongxia Chang, Lian Liu, Zisen Kong, Pengyuan Li, Yao Zhao

TL;DR

This work targets radar-camera fusion for robust 3D object detection in autonomous driving. It introduces RCAlign, a framework built around Dual-Route Alignment (DRA) to enable inter-modal feature interaction and Radar Feature Enhancement (RFE) to densify sparse radar BEV features, guided by contrastive learning and knowledge distillation. The approach achieves state-of-the-art results on the nuScenes benchmark, including substantial gains in NDS and mAP, and shows strong robustness across varying lighting and weather conditions. The combination of inter-modal alignment and radar densification offers practical improvements for reliable multi-sensor perception in real-world driving scenarios.

Abstract

Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3\% NDS and 8.4\% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).

Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

TL;DR

This work targets radar-camera fusion for robust 3D object detection in autonomous driving. It introduces RCAlign, a framework built around Dual-Route Alignment (DRA) to enable inter-modal feature interaction and Radar Feature Enhancement (RFE) to densify sparse radar BEV features, guided by contrastive learning and knowledge distillation. The approach achieves state-of-the-art results on the nuScenes benchmark, including substantial gains in NDS and mAP, and shows strong robustness across varying lighting and weather conditions. The combination of inter-modal alignment and radar densification offers practical improvements for reliable multi-sensor perception in real-world driving scenarios.

Abstract

Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3\% NDS and 8.4\% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).

Paper Structure

This paper contains 18 sections, 6 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Different alignment methods: Dense BEV Alignment (a), Sparse BEV Alignment (b) and Dual-Route Alignment (ours c). Q, V, BEV, PV and CL indicate Query, Value, Bird-Eye-View, Perception-View and Contrastive Loss separately. The semi-transparent part and fully solid part form their own systems.
  • Figure 2: The overall architecture of RCAlign. Multi-view images and radar points are fed into the backbone to extract modal-specific features. Then radar head served as an auxiliary task for predicting both the centre of the 3D boxes and the radar heatmaps. After that, the designed dual-route alignment utilizes sparse queries to align and fuse the features of the two modalities. Finally, the occupancy features obtained by projecting the centre of the 3D boxes onto the BEV grid, along with the radar features, are input into the RFE module to enhance the radar BEV features. Sparse queries are composed of radar queries (red), initial queries (white) and temporal queries (yellow).
  • Figure 3: The proposed Dual-Route Alignment (DRA) module (a) and Radar Feature Enhancement (RFE) module (b). The DRA firstly utilizes sparse queries to successively aggregate radar BEV features and image PV features through two separate paths. Then, The updated queries of the two paths are aligned by contrastive loss. Finally, the fusion queries are obtained by element-wise addition of the updated sparse queries from the two paths. For RFE, the occupancy features and the radar BEV features are concatenated and passed through a three-layer conv block to obtain dense radar features. Subsequently, knowledge distillation loss is employed to enhance the original radar features. The KD loss denotes Knowledge Distillation loss.
  • Figure 4: Visualisation results of RCAlign. The red and blue boxes indicate ground truth and prediction, respectively. The orange and green circles indicate examples of predicted successes and failures in dense populations, respectively. The GT denotes ground truth.
  • Figure 5: More visualisation results. The first two rows indicate day, the middle two rows indicate night, and the last two rows indicate rain.