Table of Contents
Fetching ...

CR3DT: Camera-RADAR Fusion for 3D Detection and Tracking

Nicolas Baumann, Michael Baumgartner, Edoardo Ghignone, Jonas Kühne, Tobias Fischer, Yung-Hsu Yang, Marc Pollefeys, Michele Magno

TL;DR

This work tackles the gap between high-performance LiDAR perception and cost-effective camera-based systems for online 3D detection and tracking. It introduces CR3DT, a camera-RADAR fusion model built on BEVDet that leverages radar-derived spatial and velocity information and employs a velocity-aware CC-3DT++ tracker for data association. The approach yields notable gains on nuScenes, with 5.3 percentage-point improvements in mAP and 14.9 percentage-point improvements in AMOTA, while maintaining online operation and modest latency overhead. The combination of cost-effective multimodal fusion and velocity-informed tracking presents a practical path toward robust, affordable autonomous driving perception.

Abstract

To enable self-driving vehicles accurate detection and tracking of surrounding objects is essential. While Light Detection and Ranging (LiDAR) sensors have set the benchmark for high-performance systems, the appeal of camera-only solutions lies in their cost-effectiveness. Notably, despite the prevalent use of Radio Detection and Ranging (RADAR) sensors in automotive systems, their potential in 3D detection and tracking has been largely disregarded due to data sparsity and measurement noise. As a recent development, the combination of RADARs and cameras is emerging as a promising solution. This paper presents Camera-RADAR 3D Detection and Tracking (CR3DT), a camera-RADAR fusion model for 3D object detection, and Multi-Object Tracking (MOT). Building upon the foundations of the State-of-the-Art (SotA) camera-only BEVDet architecture, CR3DT demonstrates substantial improvements in both detection and tracking capabilities, by incorporating the spatial and velocity information of the RADAR sensor. Experimental results demonstrate an absolute improvement in detection performance of 5.3% in mean Average Precision (mAP) and a 14.9% increase in Average Multi-Object Tracking Accuracy (AMOTA) on the nuScenes dataset when leveraging both modalities. CR3DT bridges the gap between high-performance and cost-effective perception systems in autonomous driving, by capitalizing on the ubiquitous presence of RADAR in automotive applications. The code is available at: https://github.com/ETH-PBL/CR3DT.

CR3DT: Camera-RADAR Fusion for 3D Detection and Tracking

TL;DR

This work tackles the gap between high-performance LiDAR perception and cost-effective camera-based systems for online 3D detection and tracking. It introduces CR3DT, a camera-RADAR fusion model built on BEVDet that leverages radar-derived spatial and velocity information and employs a velocity-aware CC-3DT++ tracker for data association. The approach yields notable gains on nuScenes, with 5.3 percentage-point improvements in mAP and 14.9 percentage-point improvements in AMOTA, while maintaining online operation and modest latency overhead. The combination of cost-effective multimodal fusion and velocity-informed tracking presents a practical path toward robust, affordable autonomous driving perception.

Abstract

To enable self-driving vehicles accurate detection and tracking of surrounding objects is essential. While Light Detection and Ranging (LiDAR) sensors have set the benchmark for high-performance systems, the appeal of camera-only solutions lies in their cost-effectiveness. Notably, despite the prevalent use of Radio Detection and Ranging (RADAR) sensors in automotive systems, their potential in 3D detection and tracking has been largely disregarded due to data sparsity and measurement noise. As a recent development, the combination of RADARs and cameras is emerging as a promising solution. This paper presents Camera-RADAR 3D Detection and Tracking (CR3DT), a camera-RADAR fusion model for 3D object detection, and Multi-Object Tracking (MOT). Building upon the foundations of the State-of-the-Art (SotA) camera-only BEVDet architecture, CR3DT demonstrates substantial improvements in both detection and tracking capabilities, by incorporating the spatial and velocity information of the RADAR sensor. Experimental results demonstrate an absolute improvement in detection performance of 5.3% in mean Average Precision (mAP) and a 14.9% increase in Average Multi-Object Tracking Accuracy (AMOTA) on the nuScenes dataset when leveraging both modalities. CR3DT bridges the gap between high-performance and cost-effective perception systems in autonomous driving, by capitalizing on the ubiquitous presence of RADAR in automotive applications. The code is available at: https://github.com/ETH-PBL/CR3DT.
Paper Structure (19 sections, 3 equations, 2 figures, 4 tables)

This paper contains 19 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Architectural Overview. The model presented in this work extends BEVDetbevdet. Detection and tracking contributions are highlighted in light blue and green, respectively. Model inputs and outputs are highlighted in bold, with the input from the six different camera views being RGB images with resolution $704 \times 256$, and the radar input being encoded in a $128 \times 128$bev grid, in which each grid cell contains 18 features. The image stream is processed through a ResNet-50 backbone and then projected into bev space utilizing a view transformer, while the stream is encoded in pillars with the feature encodings detailed in \ref{['subsec:sensor_fusion']}. The two intermediate outputs are then concatenated and processed through a ResNetbev Encoder as in bevdet. After this step, the rasterized pillars are concatenated once more through a residual connection. The generated bev features are then passed through a CenterPoint detection head centerpoint, which generates the detection result. The output bounding boxes are utilized to select and extract appearance embeddings cc3dtqd3dt. Finally, these embeddings are used in the Data Association (DA) step, which generates the tracking results leveraging a refined velocity estimate together with the velocity output from the detection module. Such an estimate is obtained with a , as in the KF3D setting of cc3dt, a two-stage state estimation that consists of a Prediction Step (P) and an Updating Step (U). Green-dotted lines represent stored data saved across timesteps.
  • Figure 2: A visualization of the point cloud aggregated into space for the fusion operation. The point cloud is shown for visualization purposes only. The space has a grid configuration of 52.2m with a resolution of 0.8m, resulting in a (128x128) dimensional feature space.