Table of Contents
Fetching ...

RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

Yunfei Long, Abhinav Kumar, Xiaoming Liu, Daniel Morris

TL;DR

This work addresses the core challenge of fuseable depth and pose information in camera-radar 3D detection by explicitly modeling radar hit distributions conditioned on object properties. The authors introduce RICCARDO, a three-stage pipeline: Stage 1 predicts an object-centered radar hit distribution (RIC) in BEV; Stage 2 convolving this distribution with accumulated radar points yields radial matching scores; Stage 3 refines candidates by integrating monocular cues and Stage-2 evidence to produce a final range estimate and score. The approach achieves state-of-the-art radar-camera fusion performance on nuScenes, demonstrating improved range estimation and robust fusion across categories, with ablations validating the value of learned radar distributions over baselines. The method is lightweight, modular, and capable of benefiting from different monocular detectors, suggesting practical impact for improving depth and pose estimation in autonomous driving with low-cost sensors.

Abstract

Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.

RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

TL;DR

This work addresses the core challenge of fuseable depth and pose information in camera-radar 3D detection by explicitly modeling radar hit distributions conditioned on object properties. The authors introduce RICCARDO, a three-stage pipeline: Stage 1 predicts an object-centered radar hit distribution (RIC) in BEV; Stage 2 convolving this distribution with accumulated radar points yields radial matching scores; Stage 3 refines candidates by integrating monocular cues and Stage-2 evidence to produce a final range estimate and score. The approach achieves state-of-the-art radar-camera fusion performance on nuScenes, demonstrating improved range estimation and robust fusion across categories, with ablations validating the value of learned radar distributions over baselines. The method is lightweight, modular, and capable of benefiting from different monocular detectors, suggesting practical impact for improving depth and pose estimation in autonomous driving with low-cost sensors.

Abstract

Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.

Paper Structure

This paper contains 24 sections, 7 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Given a (a) monocular detection, we estimate (b) radar point distribution relative to its bounding box in BEV; then we shift the distribution and convolve it with (c) actual radar measurement in the neighborhood to compute (d) similarity scores and estimate an updated position, where the matching score is maximum. In (c) the monocular bounding box (in magenta) is misaligned with radar points; the updated position (in orange) with peak matching score shifts the box to a farther range so that relative positions of radar points match the predicted distribution (radar hits concentrated at the head of vehicle instead of in the middle).
  • Figure 1: Stage-1 Network Structure. The class input is in one-hot encoding; $z$ represents heights of bounding box bottom faces; $\theta_\text{AZ}$ stands for azimuths of objects in ego coordinates; $\theta_\text{Y}$ and $\theta_\text{Y}^\prime$ are object yaws in ego coordinates and relative yaws ( i.e., $\theta_\text{Y} - \theta_\text{AZ}$), respectively. "C" represents concatenation, and "Linear" denotes a linear transformation layer. Feature sizes are marked besides network layers.
  • Figure 2: RICCARDO inference. RICCARDO leverages a monocular detector to identify objects and estimate their attributes (category, size, orientation, and approximate range) and involves three stages. Its Stage 1 then predicts the radar hit distribution (RIC) for each object. Stage 2 bins and convolves the observed accumulated radar returns with the RIC, to generate a matching score over range. A final Stage 3 fusion refines these scores to yield a precise target range estimate.
  • Figure 2: Stage-3 Network Structure. The inputs $v_\text{x}$ and $v_\text{y}$ are monocular estimated object velocities in ego coordinates; $v_\text{R}$ and $v_\text{T}$ are monocular velocities in radial and tangential directions, respectively; $S_\text{CAM}$ and $S_\text{STG2}$ represent monocular detection scores and Stage-2 matching scores, respectively.
  • Figure 3: Stage-1 RIC Model Training. The radial ray from ego to target is plotted as dashed line for reference.
  • ...and 6 more figures