Table of Contents
Fetching ...

BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation

Tavis Shore, Simon Hadfield, Oscar Mendez

TL;DR

This work addresses the challenge of cross-view geo-localisation with limited field-of-view cameras by introducing BEV-CV, which semantic-ges ground-level POV images into a Birds-Eye-View before cross-view embedding. It employs a two-branch architecture that projects POV and aerial features into a shared BEV latent space and trains with an NT-Xent loss, achieving state-of-the-art recall gains on CVUSA and CVACT while substantially reducing computation and embedding dimensionality. The approach improves practical viability for GNSS-denied localization in mobile robotics, offering faster query times and lower memory requirements, albeit with a dependency on camera intrinsics for BEV transformation. Future work aims to remove intrinsic dependence and broaden robustness to varying regions, lighting, and weather conditions to extend BEV-CV’s applicability across realistic deployment scenarios.

Abstract

Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints. The method provides localisation capabilities from geo-referenced images, eliminating the need for external devices or costly equipment. This enhances the capacity of agents to autonomously determine their position, navigate, and operate effectively in GNSS-denied environments. Current research employs a variety of techniques to reduce the domain gap such as applying polar transforms to aerial images or synthesising between perspectives. However, these approaches generally rely on having a 360° field of view, limiting real-world feasibility. We propose BEV-CV, an approach introducing two key novelties with a focus on improving the real-world viability of cross-view geo-localisation. Firstly bringing ground-level images into a semantic Birds-Eye-View before matching embeddings, allowing for direct comparison with aerial image representations. Secondly, we adapt datasets into application realistic format - limited Field-of-View images aligned to vehicle direction. BEV-CV achieves state-of-the-art recall accuracies, improving Top-1 rates of 70° crops of CVUSA and CVACT by 23% and 24% respectively. Also decreasing computational requirements by reducing floating point operations to below previous works, and decreasing embedding dimensionality by 33% - together allowing for faster localisation capabilities.

BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation

TL;DR

This work addresses the challenge of cross-view geo-localisation with limited field-of-view cameras by introducing BEV-CV, which semantic-ges ground-level POV images into a Birds-Eye-View before cross-view embedding. It employs a two-branch architecture that projects POV and aerial features into a shared BEV latent space and trains with an NT-Xent loss, achieving state-of-the-art recall gains on CVUSA and CVACT while substantially reducing computation and embedding dimensionality. The approach improves practical viability for GNSS-denied localization in mobile robotics, offering faster query times and lower memory requirements, albeit with a dependency on camera intrinsics for BEV transformation. Future work aims to remove intrinsic dependence and broaden robustness to varying regions, lighting, and weather conditions to extend BEV-CV’s applicability across realistic deployment scenarios.

Abstract

Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints. The method provides localisation capabilities from geo-referenced images, eliminating the need for external devices or costly equipment. This enhances the capacity of agents to autonomously determine their position, navigate, and operate effectively in GNSS-denied environments. Current research employs a variety of techniques to reduce the domain gap such as applying polar transforms to aerial images or synthesising between perspectives. However, these approaches generally rely on having a 360° field of view, limiting real-world feasibility. We propose BEV-CV, an approach introducing two key novelties with a focus on improving the real-world viability of cross-view geo-localisation. Firstly bringing ground-level images into a semantic Birds-Eye-View before matching embeddings, allowing for direct comparison with aerial image representations. Secondly, we adapt datasets into application realistic format - limited Field-of-View images aligned to vehicle direction. BEV-CV achieves state-of-the-art recall accuracies, improving Top-1 rates of 70° crops of CVUSA and CVACT by 23% and 24% respectively. Also decreasing computational requirements by reducing floating point operations to below previous works, and decreasing embedding dimensionality by 33% - together allowing for faster localisation capabilities.
Paper Structure (20 sections, 6 equations, 4 figures, 4 tables)

This paper contains 20 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: General BEV-CV network structure. POV Branch extracts and transforms ground-level feature embeddings, Map Branch extracts aerial embeddings to build a KDTree. Components to the right of dotted red lines are discarded in the final BEV-CV architecture.
  • Figure 2: BEV-CV network overview: BEV Branch is shown as the upper pathway, transforming from POV to BEV before extracting the embedding for projecting, the Aerial Branch is the lower pathway, extracting embeddings from the U-Net latent space. At training time we use an NT-Xent loss function and at inference time we build a KDTree of aerial embedding and query this with POV embeddings using descriptor cosine similarity for retrieval.
  • Figure 3: Panoramic examples of CVUSA and CVACT, heading aligned $90\degree$ FOV crops shown on the right hand side.
  • Figure 4: BEV-CV CVUSA Top-5 recall examples. Outlines: Purple - query POV image, green - correct aerial image, red - incorrect aerial image