Table of Contents
Fetching ...

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

Xiaohan Zhang, Xingyu Li, Waqas Sultani, Chen Chen, Safwan Wshah

TL;DR

GeoDTR+ tackles cross-area CVGL by explicitly disentangling geometric layout from appearance using a Geometric Layout Extractor (GLE) and by augmenting training with Layout Simulation and Semantic Augmentation. It introduces Contrastive Hard Samples Generation (CHSG) to enforce intra-batch hard-negative learning and a counterfactual learning scheme to avoid degenerate geometric descriptors. The combination yields state-of-the-art cross-area performance on CVUSA, CVACT, and VIGOR while preserving competitive same-area results and maintaining a small parameter footprint. Overall, the work advances practical cross-view geolocalization by stabilizing geometry-driven representations and efficient hard-sample learning.”

Abstract

Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Recent works achieve outstanding progress on CVGL benchmarks. However, existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models' overfitting to low-level details. Our preliminary work introduced a Geometric Layout Extractor (GLE) to capture the geometric layout from input features. However, the previous GLE does not fully exploit information in the input feature. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features. To fully explore the LS techniques from our preliminary work, we further propose Contrastive Hard Samples Generation (CHSG) to facilitate model training. Extensive experiments show that GeoDTR+ achieves state-of-the-art (SOTA) results in cross-area evaluation on CVUSA, CVACT, and VIGOR by a large margin ($16.44\%$, $22.71\%$, and $13.66\%$ without polar transformation) while keeping the same-area performance comparable to existing SOTA. Moreover, we provide detailed analyses of GeoDTR+. Our code will be available at https://gitlab.com/vail-uvm/geodtr plus.

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

TL;DR

GeoDTR+ tackles cross-area CVGL by explicitly disentangling geometric layout from appearance using a Geometric Layout Extractor (GLE) and by augmenting training with Layout Simulation and Semantic Augmentation. It introduces Contrastive Hard Samples Generation (CHSG) to enforce intra-batch hard-negative learning and a counterfactual learning scheme to avoid degenerate geometric descriptors. The combination yields state-of-the-art cross-area performance on CVUSA, CVACT, and VIGOR while preserving competitive same-area results and maintaining a small parameter footprint. Overall, the work advances practical cross-view geolocalization by stabilizing geometry-driven representations and efficient hard-sample learning.”

Abstract

Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Recent works achieve outstanding progress on CVGL benchmarks. However, existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models' overfitting to low-level details. Our preliminary work introduced a Geometric Layout Extractor (GLE) to capture the geometric layout from input features. However, the previous GLE does not fully exploit information in the input feature. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features. To fully explore the LS techniques from our preliminary work, we further propose Contrastive Hard Samples Generation (CHSG) to facilitate model training. Extensive experiments show that GeoDTR+ achieves state-of-the-art (SOTA) results in cross-area evaluation on CVUSA, CVACT, and VIGOR by a large margin (, , and without polar transformation) while keeping the same-area performance comparable to existing SOTA. Moreover, we provide detailed analyses of GeoDTR+. Our code will be available at https://gitlab.com/vail-uvm/geodtr plus.
Paper Structure (45 sections, 7 equations, 10 figures, 13 tables)

This paper contains 45 sections, 7 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparison of R@$1$ accuracy between four recently published CVGL methods, including SAFA SAFA, DSM DSM, TransGeo transgeo, L2LTR l2ltr, GeoDTR GeoDTR, and our proposed GeoDTR+ on same-area performance (x axis) and cross-area performance (y axis). Notice that our preliminary work GeoDTR achieves the SOTA same-area and cross-area performance. In this work, built upon the GeoDTR, the proposed GeoDTR+ further improves more, especially on cross-area performance.
  • Figure 2: The overview pipeline of our proposed model GeoDTR+. The Contrastive Hard Sample Generation (CHSG) first samples an aerial-ground pair $P_{o}$ from the training dataset and generate hard samples $P_{\gamma}$ and $P_{\delta}$. The proposed Geometric Layout Extractor (GLE) predicts the layout descriptors $\mathbf{q}^{a(g)}$ from the raw feature $\mathbf{r}^{a(g)}$. The predicted latent representation $f^{a(g)}$ is obtained from the frobenius product between $\mathbf{r}^{a(g)}$ and $\mathbf{q}^{a(g)}$. The proposed counterfactual learning provides an auxiliary supervision signal to train the model.
  • Figure 3: Comparison between the previous GLE and the proposed GLE. (a) is the GLE from our previous GeoDTR GeoDTR. (b) is the enhanced GLE for GeoDTR+.
  • Figure 4: (a) Illustration of the layout simulation. The left column is the aerial images and the right column is the ground images. The yellow arrows and lines indicate the north direction. (b) Three randomly sampled contrastive pairs from the CVUSA dataset. (c) Illustration of the proposed counterfactual learning schema. The arrows indicate the causal relation between two variables. The predicted feature $f^v$ and imaginary feature $\hat{f}^v$ are pushed away from each other by $L_{cf}$ (the dashed arrow) to provide weak supervision on raw feature $r^v$ and geometric layout descriptors $q^v$ to capture more distinctive geometric clues.
  • Figure 5: Six sample aerial-ground pairs from our training data. The top two pairs are from CVUSA CVUSA dataset. The middle two pairs are from CVACT liu2019lending dataset. The bottom two pairs are from VIGOR Vigor dataset.
  • ...and 5 more figures