Table of Contents
Fetching ...

BEVLoc: Cross-View Localization and Matching via Birds-Eye-View Synthesis

Christopher Klammer, Michael Kaess

TL;DR

This work proposes a novel framework for synthesizing a birds-eye-view (BEV) scene representation to match and localize against an aerial map in off-road environments, and analyzes the model’s performance for coarse and fine matching.

Abstract

Ground to aerial matching is a crucial and challenging task in outdoor robotics, particularly when GPS is absent or unreliable. Structures like buildings or large dense forests create interference, requiring GNSS replacements for global positioning estimates. The true difficulty lies in reconciling the perspective difference between the ground and air images for acceptable localization. Taking inspiration from the autonomous driving community, we propose a novel framework for synthesizing a birds-eye-view (BEV) scene representation to match and localize against an aerial map in off-road environments. We leverage contrastive learning with domain specific hard negative mining to train a network to learn similar representations between the synthesized BEV and the aerial map. During inference, BEVLoc guides the identification of the most probable locations within the aerial map through a coarse-to-fine matching strategy. Our results demonstrate promising initial outcomes in extremely difficult forest environments with limited semantic diversity. We analyze our model's performance for coarse and fine matching, assessing both the raw matching capability of our model and its performance as a GNSS replacement. Our work delves into off-road map localization while establishing a foundational baseline for future developments in localization. Our code is available at: https://github.com/rpl-cmu/bevloc

BEVLoc: Cross-View Localization and Matching via Birds-Eye-View Synthesis

TL;DR

This work proposes a novel framework for synthesizing a birds-eye-view (BEV) scene representation to match and localize against an aerial map in off-road environments, and analyzes the model’s performance for coarse and fine matching.

Abstract

Ground to aerial matching is a crucial and challenging task in outdoor robotics, particularly when GPS is absent or unreliable. Structures like buildings or large dense forests create interference, requiring GNSS replacements for global positioning estimates. The true difficulty lies in reconciling the perspective difference between the ground and air images for acceptable localization. Taking inspiration from the autonomous driving community, we propose a novel framework for synthesizing a birds-eye-view (BEV) scene representation to match and localize against an aerial map in off-road environments. We leverage contrastive learning with domain specific hard negative mining to train a network to learn similar representations between the synthesized BEV and the aerial map. During inference, BEVLoc guides the identification of the most probable locations within the aerial map through a coarse-to-fine matching strategy. Our results demonstrate promising initial outcomes in extremely difficult forest environments with limited semantic diversity. We analyze our model's performance for coarse and fine matching, assessing both the raw matching capability of our model and its performance as a GNSS replacement. Our work delves into off-road map localization while establishing a foundational baseline for future developments in localization. Our code is available at: https://github.com/rpl-cmu/bevloc
Paper Structure (23 sections, 17 equations, 6 figures, 2 tables)

This paper contains 23 sections, 17 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Visualization of the coarse to fine matching strategy used by BEVLoc. The top panes show the coarse matches with the map aligned image and the corresponding correlation volume. The bottom panes show the map rotated by the predicted yaw angle and the refinement of the coarse location to create a probabilistic prediction for the localization by weighting the correlation maps over the top $k$ coarse matches. The red circle denotes the ground truth localization, the black circle denotes the predicted localization.
  • Figure 2: Illustration of our method's performance, leveraging registration estimates to mitigate drift from visual odometry. Left: Comparison of BEVLoc and Tartan VO against the ground truth GPS trajectory. Right: RPE performance of BEVLoc throughout a long trajectory against GPS ground truth.
  • Figure 3: BevLoc Contrastive Learning Training Pipeline. Feature maps are encoded from ground and aerial camera images. The ground features are lifted to 3D to create a semantic and temporally consistent BEV feature map to be compared against the aerial feature map. The embeddings are created and used to find hard negatives near the prior location and prior rotation to learn how to match ground to aerial images.
  • Figure 4: An illustration of the global pose graph which corresponds to the 3DoF pose of the vehicle. As a vision-only pipeline, we utilize relative visual odometry measurements and high quality registration estimates.
  • Figure 5: Training and testing splits on TartanDrive 2.0 traced in Google Maps. Training trajectories are in red and testing trajectories are in blue with less than five percent trajectory overlap.
  • ...and 1 more figures