Table of Contents
Fetching ...

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, Yanyong Zhang

TL;DR

RaCFormer tackles depth-induced misalignment in radar-camera fusion for 3D object detection by introducing a query-based cross-perspective fusion framework. It combines adaptive circular query initialization, radar-aware depth prediction, and an implicit dynamic catcher to fuse features from camera and radar across image view and BEV, leveraging Doppler information for temporal awareness. The approach achieves state-of-the-art results on nuScenes and VoD, exemplified by 64.9% mAP and 70.2% NDS on nuScenes test and strong VoD performance, while offering a real-time lightweight variant at 12 FPS. These findings demonstrate the value of cross-perspective fusion and temporal radar cues for robust, high-performance 3D perception in autonomous systems.

Abstract

We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptive sampling of instance-relevant features from both the bird's-eye view (BEV) and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes. RaCFormer also secures the state-of-the-art performance on the VoD dataset. Code is available at https://github.com/cxmomo/RaCFormer.

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

TL;DR

RaCFormer tackles depth-induced misalignment in radar-camera fusion for 3D object detection by introducing a query-based cross-perspective fusion framework. It combines adaptive circular query initialization, radar-aware depth prediction, and an implicit dynamic catcher to fuse features from camera and radar across image view and BEV, leveraging Doppler information for temporal awareness. The approach achieves state-of-the-art results on nuScenes and VoD, exemplified by 64.9% mAP and 70.2% NDS on nuScenes test and strong VoD performance, while offering a real-time lightweight variant at 12 FPS. These findings demonstrate the value of cross-perspective fusion and temporal radar cues for robust, high-performance 3D perception in autonomous systems.

Abstract

We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptive sampling of instance-relevant features from both the bird's-eye view (BEV) and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes. RaCFormer also secures the state-of-the-art performance on the VoD dataset. Code is available at https://github.com/cxmomo/RaCFormer.

Paper Structure

This paper contains 13 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Motivation of RaCFormer. (a) Previous methods typically fuse BEV features from image-view transformation and radar point cloud encoding, by concatenation or cross-attention. (b) Instead, RaCFormer uses a query-based fusion framework by simultaneously sampling radar-enhanced image-view features, camera-transformed BEV features, and radar-encoded BEV features.
  • Figure 2: Overall architecture of RaCFormer. The image encoder extracts features from multiple frames of multi-camera images, while multi-frame radar points are voxelized and processed by a pillar encoder. The radar features are flattened into the BEV and enhanced by an implicit dynamic catcher. Simultaneously, radar points are re-projected onto the image plane, with their depth values extended to the full image height, and merged with image features in the depth head to refine depth prediction. The refined depth probability distribution $D'$ and the image features are then input into the lift-splat-shoot (LSS) module to create camera BEV features. The transformer decoder initializes queries with an adjustable circular distribution. Over $L$ layers, a ray sampling module within each layer extracts both image-view and BEV features to refine queries, enabling precise classification and regression by the subsequent heads.
  • Figure 3: The visualization of (a) radar points with raw z-coordinates projected onto the image and the flowchart of (b) pre-processing input data for the radar-aware depth head.
  • Figure 4: The structure of our implicit dynamic catcher. $h_{t}$ represents the hidden state at time $t$, with $h_{0}$ being a preset value of zeros. $x_{t}$ denotes the BEV features output by the pillar encoder at time $t$, while $x'_{t}$ indicates the updated BEV features from $x_{t}$.
  • Figure 5: Comparison of query initialization methods: (a) Radial distribution: Queries are evenly spaced along each ray, with a constant angle $\theta$ separating adjacent rays. (b) Linearly increasing circular distribution: The parameter $n$ denotes the query count in the innermost circle, and the linear growth factor of each outer circle is $\alpha$. The parameter $k$ indicates the query count per ray in (a) and the number of concentric circles in (b).
  • ...and 1 more figures