BEV-CV: Birds-Eye-View Transform for Cross-View Geo-Localisation
Tavis Shore, Simon Hadfield, Oscar Mendez
TL;DR
This work addresses the challenge of cross-view geo-localisation with limited field-of-view cameras by introducing BEV-CV, which semantic-ges ground-level POV images into a Birds-Eye-View before cross-view embedding. It employs a two-branch architecture that projects POV and aerial features into a shared BEV latent space and trains with an NT-Xent loss, achieving state-of-the-art recall gains on CVUSA and CVACT while substantially reducing computation and embedding dimensionality. The approach improves practical viability for GNSS-denied localization in mobile robotics, offering faster query times and lower memory requirements, albeit with a dependency on camera intrinsics for BEV transformation. Future work aims to remove intrinsic dependence and broaden robustness to varying regions, lighting, and weather conditions to extend BEV-CV’s applicability across realistic deployment scenarios.
Abstract
Cross-view image matching for geo-localisation is a challenging problem due to the significant visual difference between aerial and ground-level viewpoints. The method provides localisation capabilities from geo-referenced images, eliminating the need for external devices or costly equipment. This enhances the capacity of agents to autonomously determine their position, navigate, and operate effectively in GNSS-denied environments. Current research employs a variety of techniques to reduce the domain gap such as applying polar transforms to aerial images or synthesising between perspectives. However, these approaches generally rely on having a 360° field of view, limiting real-world feasibility. We propose BEV-CV, an approach introducing two key novelties with a focus on improving the real-world viability of cross-view geo-localisation. Firstly bringing ground-level images into a semantic Birds-Eye-View before matching embeddings, allowing for direct comparison with aerial image representations. Secondly, we adapt datasets into application realistic format - limited Field-of-View images aligned to vehicle direction. BEV-CV achieves state-of-the-art recall accuracies, improving Top-1 rates of 70° crops of CVUSA and CVACT by 23% and 24% respectively. Also decreasing computational requirements by reducing floating point operations to below previous works, and decreasing embedding dimensionality by 33% - together allowing for faster localisation capabilities.
