CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition

Shaowei Fu; Yifan Duan; Yao Li; Chengzhen Meng; Yingjie Wang; Jianmin Ji; Yanyong Zhang

CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition

Shaowei Fu, Yifan Duan, Yao Li, Chengzhen Meng, Yingjie Wang, Jianmin Ji, Yanyong Zhang

TL;DR

A background-attentive camera-radar fusion-based place recognition network that generates background-attentive global descriptors from multi-view images and radar point clouds for accurate place recognition, and is evaluated thoroughly on the nuScenes dataset.

Abstract

The integration of complementary characteristics from camera and radar data has emerged as an effective approach in 3D object detection. However, such fusion-based methods remain unexplored for place recognition, an equally important task for autonomous systems. Given that place recognition relies on the similarity between a query scene and the corresponding candidate scene, the stationary background of a scene is expected to play a crucial role in the task. As such, current well-designed camera-radar fusion methods for 3D object detection can hardly take effect in place recognition because they mainly focus on dynamic foreground objects. In this paper, a background-attentive camera-radar fusion-based method, named CRPlace, is proposed to generate background-attentive global descriptors from multi-view images and radar point clouds for accurate place recognition. To extract stationary background features effectively, we design an adaptive module that generates the background-attentive mask by utilizing the camera BEV feature and radar dynamic points. With the guidance of a background mask, we devise a bidirectional cross-attention-based spatial fusion strategy to facilitate comprehensive spatial interaction between the background information of the camera BEV feature and the radar BEV feature. As the first camera-radar fusion-based place recognition network, CRPlace has been evaluated thoroughly on the nuScenes dataset. The results show that our algorithm outperforms a variety of baseline methods across a comprehensive set of metrics (recall@1 reaches 91.2%).

CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 6 figures, 4 tables)

This paper contains 23 sections, 8 equations, 6 figures, 4 tables.

Introduction
Related work
Single-modal Place Recognition
Multi-modal Place Recognition
Rotation Invariance
Method
Modality-Specific BEV Feature Encoding
Multi-View Image Feature Encoding
Radar Feature Encoding
Background-Attentive Mask Generation
Bidirectional Spatial Fusion
Radar-to-Image Fusion
Image-to-Radar Fusion
Convolution-based Fusion
Global Descriptor Generator
...and 8 more sections

Figures (6)

Figure 1: An illustration of (a) place recognition task with camera and radar fusion and (b) the place recognition results using image-only and fusion-based methods, respectively. Given an image query (marked in blue bounding box) that includes multiple dynamic objects, the image-based place recognition xu2023leveraging retrieves an incorrect candidate due to the influence of dynamic objects (marked in red bounding box), while our method retrieves the correct candidate successfully (marked in green bounding box) with the image-radar query acquired from the same place.
Figure 2: The network architecture of the proposed CRPlace. Given multi-view images and radar point clouds, two modality-specific streams separately extract features and transform them into the same BEV space at first. Next, the Background-Attentive Mask Generation (BAMG) module uses radar dynamic points and camera BEV features to create a background attention mask adaptively. Then the Bidirectional Spatial Fusion (BSF) module attentively fuses background BEV features from these two modalities. Finally, the Global Descriptor Generator uses the fused BEV features to generate rotation-invariant global descriptors.
Figure 3: An illustration of the BAMG module. All dynamic points are selected from radar point clouds and voxelized into a grid. Then the radar voxel grid and camera BEV feature are utilized to generate the background attention mask adaptively according to their positional relationships. $(x_{ne},y_{ne})$ denotes the non-empty voxel.
Figure 4: An illustration of the Bidirectional Spatial Fusion block. Take camera BEV feature, radar BEV feature, and background attention mask as input, a Self-Attention module is first applied to these two features respectively. Then a Radar-to-Image Fusion and an Image-to-Radar Fusion are used for bidirectional spatial interaction. Finally, a convolution-based fusion operation is performed. A linear layer is used to generate the input of radar feature for the next block.
Figure 5: Precision-recall curve of SOTA methods on the nuScenes dataset.
...and 1 more figures

CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition

TL;DR

Abstract

CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)