Table of Contents
Fetching ...

Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization

Lei Cheng, Teng Wang, Lingquan Meng, Changyin Sun

TL;DR

Cross-view geo-localization suffers from large viewpoint changes when ground images have unknown orientation and limited FoV. The paper introduces Window-to-Window BEV learning (W2W-BEV), which constructs BEV representations directly from ground-view features by initializing BEV embeddings from depth-informed ground features, then establishing local correspondences via a context-aware window matching strategy and refining the BEV through cross-attention. This approach yields substantial gains over prior methods, achieving notable improvements in R@1 on CVUSA and CVACT, particularly at a 90° FoV (e.g., from $47.24\%$ to $64.73\%$ on CVUSA), and remains effective under unknown orientation and various FoVs. Key contributions include depth-assisted BEV initialization, window-level correspondence instead of point-to-point alignment, and an efficient BEV encoder that aggregates multi-scale features. The method enables robust cross-view localization in practical settings, though it incurs higher memory and computation, and could benefit from advances in depth estimation.

Abstract

Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.

Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization

TL;DR

Cross-view geo-localization suffers from large viewpoint changes when ground images have unknown orientation and limited FoV. The paper introduces Window-to-Window BEV learning (W2W-BEV), which constructs BEV representations directly from ground-view features by initializing BEV embeddings from depth-informed ground features, then establishing local correspondences via a context-aware window matching strategy and refining the BEV through cross-attention. This approach yields substantial gains over prior methods, achieving notable improvements in R@1 on CVUSA and CVACT, particularly at a 90° FoV (e.g., from to on CVUSA), and remains effective under unknown orientation and various FoVs. Key contributions include depth-assisted BEV initialization, window-level correspondence instead of point-to-point alignment, and an efficient BEV encoder that aggregates multi-scale features. The method enables robust cross-view localization in practical settings, though it incurs higher memory and computation, and could benefit from advances in depth estimation.

Abstract

Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.
Paper Structure (25 sections, 7 equations, 7 figures, 5 tables)

This paper contains 25 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of ground images with unknown direction and different limited FoVs. The second and third rows respectively indicate the regions of interest corresponding to the BEV representations in ground-level images and aerial images. We can observe that BEV representations successfully extracts key spatial structure information from ground-level images. Furthermore, as the FoV of ground image increases, BEV representations can learn more extensive and effective information from the ground, corresponding to a broader range of interest in aerial images.
  • Figure 2: The overview of proposed method. To facilitate the learning of BEV representations, we utilize ground feature C4 from multi-scale features to predict it depth probability. This extends the 2D features into 3D, which are then compressed along $H$ dimension to generate initial BEV embeddings.
  • Figure 3: Recall performance at top-1 of our model with different training and testing of FoVs.
  • Figure 4: The effect of the number of BEV encoder blocks on Recall accuracy with different FoVs is compared.
  • Figure 5: The effect of the size of BEV embedding on Recall accuracy with different FoVs is compared.
  • ...and 2 more figures