Table of Contents
Fetching ...

BEV$^2$PR: BEV-Enhanced Visual Place Recognition with Structural Cues

Fudong Ge, Yiwei Zhang, Shuhan Shen, Yue Wang, Weiming Hu, Jin Gao

TL;DR

BEV$^2$PR introduces a monocular BEV-based structural cue pathway to augment RGB-based visual place recognition. The method employs a two-stage training pipeline with a shared bottom backbone to fuse BEV-derived semantics and visual features into a composite descriptor, achieving improvements over multiple RGB aggregators. Experiments on the VPR-NuScenes dataset show substantial gains, particularly in hard illumination and blur scenarios, and generalize to KITTI. The work demonstrates that explicit BEV structure can enhance place recognition without additional sensors, suggesting practical benefits for robust localization in mobile robots and autonomous vehicles.

Abstract

In this paper, we propose a new image-based visual place recognition (VPR) framework by exploiting the structural cues in bird's-eye view (BEV) from a single monocular camera. The motivation arises from two key observations about place recognition methods based on both appearance and structure: 1) For the methods relying on LiDAR sensors, the integration of LiDAR in robotic systems has led to increased expenses, while the alignment of data between different sensors is also a major challenge. 2) Other image-/camera-based methods, involving integrating RGB images and their derived variants (eg, pseudo depth images, pseudo 3D point clouds), exhibit several limitations, such as the failure to effectively exploit the explicit spatial relationships between different objects. To tackle the above issues, we design a new BEV-enhanced VPR framework, namely BEV$^2$PR, generating a composite descriptor with both visual cues and spatial awareness based on a single camera. The key points lie in: 1) We use BEV features as an explicit source of structural knowledge in constructing global features. 2) The lower layers of the pre-trained backbone from BEV generation are shared for visual and structural streams in VPR, facilitating the learning of fine-grained local features in the visual stream. 3) The complementary visual and structural features can jointly enhance VPR performance. Our BEV$^2$PR framework enables consistent performance improvements over several popular aggregation modules for RGB global features. The experiments on our collected VPR-NuScenes dataset demonstrate an absolute gain of 2.47% on Recall@1 for the strong Conv-AP baseline to achieve the best performance in our setting, and notably, a 18.06% gain on the hard set. The code and dataset will be available at https://github.com/FudongGe/BEV2PR.

BEV$^2$PR: BEV-Enhanced Visual Place Recognition with Structural Cues

TL;DR

BEVPR introduces a monocular BEV-based structural cue pathway to augment RGB-based visual place recognition. The method employs a two-stage training pipeline with a shared bottom backbone to fuse BEV-derived semantics and visual features into a composite descriptor, achieving improvements over multiple RGB aggregators. Experiments on the VPR-NuScenes dataset show substantial gains, particularly in hard illumination and blur scenarios, and generalize to KITTI. The work demonstrates that explicit BEV structure can enhance place recognition without additional sensors, suggesting practical benefits for robust localization in mobile robots and autonomous vehicles.

Abstract

In this paper, we propose a new image-based visual place recognition (VPR) framework by exploiting the structural cues in bird's-eye view (BEV) from a single monocular camera. The motivation arises from two key observations about place recognition methods based on both appearance and structure: 1) For the methods relying on LiDAR sensors, the integration of LiDAR in robotic systems has led to increased expenses, while the alignment of data between different sensors is also a major challenge. 2) Other image-/camera-based methods, involving integrating RGB images and their derived variants (eg, pseudo depth images, pseudo 3D point clouds), exhibit several limitations, such as the failure to effectively exploit the explicit spatial relationships between different objects. To tackle the above issues, we design a new BEV-enhanced VPR framework, namely BEVPR, generating a composite descriptor with both visual cues and spatial awareness based on a single camera. The key points lie in: 1) We use BEV features as an explicit source of structural knowledge in constructing global features. 2) The lower layers of the pre-trained backbone from BEV generation are shared for visual and structural streams in VPR, facilitating the learning of fine-grained local features in the visual stream. 3) The complementary visual and structural features can jointly enhance VPR performance. Our BEVPR framework enables consistent performance improvements over several popular aggregation modules for RGB global features. The experiments on our collected VPR-NuScenes dataset demonstrate an absolute gain of 2.47% on Recall@1 for the strong Conv-AP baseline to achieve the best performance in our setting, and notably, a 18.06% gain on the hard set. The code and dataset will be available at https://github.com/FudongGe/BEV2PR.
Paper Structure (18 sections, 16 equations, 4 figures, 6 tables)

This paper contains 18 sections, 16 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Schematic diagram of methods based on both appearance and structure using camera and/or LiDAR sensors: a) methods using RGB, segmentation and depth images, b) methods using RGB images from camera and 3D point clouds from LiDAR, c) methods using RGB and segmentation images, d) methods using RGB images and 3D point clouds reconstructed from multiple RGB images, e) methods using RGB and depth images, f) (ours) using RGB images and BEV generated from RGB images.
  • Figure 2: Overview of our proposed pipeline. (1) In Stage I, we pre-train a BEV generation model using a front-view image to extract BEV features as a more explicit source of structural knowledge in the global feature. (2) In Stage II, we first copy the modules of the BEV model as part of the structural stream and freeze them to introduce BEV features into VPR, then crop its backbone into two parts, with the former serving as the bottom backbone shared with the visual stream and the latter copied as the sub-backbone of visual stream and then unfrozen. After that, any aggregation module for RGB global features is inserted into the visual stream and a relatively simple GeM module is used in the structural stream, followed by a feature fusion operation. (3) Finally, the nearest neighbor search is used to retrieval the top-$k$ images.
  • Figure 3: Example of samples with different recall difficulties on our collected VPR-NuScenes dataset.
  • Figure 4: Visualization comparisons of RGB and BEV features. Obviously, BEV features exhibit more structural characteristics than RGB features.