Table of Contents
Fetching ...

Learning Cross-view Visual Geo-localization without Ground Truth

Haoyuan Li, Chang Xu, Wen Yang, Huai Yu, Gui-Song Xia

TL;DR

This work tackles CVGL without ground-truth supervision by freezing a Foundation Model and training a lightweight adapter through a self-supervised pipeline. It introduces an EM-based Pseudo-Labeling (EMPL) module to infer cross-view correspondences from unlabeled data and an Adaptation Information Consistency (AIC) module to preserve the FM's robustness while bridging view gaps. Through experiments on University-1652, University-160k, CVUSA, and CVACT, the method yields substantial gains over pure FM generalization and competitive accuracy versus supervised baselines, with far fewer trainable parameters. The approach also boosts performance of task-specific pre-trained models on new datasets, underscoring its broad applicability and practicality for real-world, label-scarce CVGL deployment.

Abstract

Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image. Current state-of-the-art methods predominantly rely on training models with labeled paired images, incurring substantial annotation costs and training burdens. In this study, we investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels. We observe that training on unlabeled cross-view images presents significant challenges, including the need to establish relationships within unlabeled data and reconcile view discrepancies between uncertain queries and references. To address these challenges, we propose a self-supervised learning framework to train a learnable adapter for a frozen Foundation Model (FM). This adapter is designed to map feature distributions from diverse views into a uniform space using unlabeled data exclusively. To establish relationships within unlabeled data, we introduce an Expectation-Maximization-based Pseudo-labeling module, which iteratively estimates associations between cross-view features and optimizes the adapter. To maintain the robustness of the FM's representation, we incorporate an information consistency module with a reconstruction loss, ensuring that adapted features retain strong discriminative ability across views. Experimental results demonstrate that our proposed method achieves significant improvements over vanilla FMs and competitive accuracy compared to supervised methods, while necessitating fewer training parameters and relying solely on unlabeled data. Evaluation of our adaptation for task-specific models further highlights its broad applicability.

Learning Cross-view Visual Geo-localization without Ground Truth

TL;DR

This work tackles CVGL without ground-truth supervision by freezing a Foundation Model and training a lightweight adapter through a self-supervised pipeline. It introduces an EM-based Pseudo-Labeling (EMPL) module to infer cross-view correspondences from unlabeled data and an Adaptation Information Consistency (AIC) module to preserve the FM's robustness while bridging view gaps. Through experiments on University-1652, University-160k, CVUSA, and CVACT, the method yields substantial gains over pure FM generalization and competitive accuracy versus supervised baselines, with far fewer trainable parameters. The approach also boosts performance of task-specific pre-trained models on new datasets, underscoring its broad applicability and practicality for real-world, label-scarce CVGL deployment.

Abstract

Cross-View Geo-Localization (CVGL) involves determining the geographical location of a query image by matching it with a corresponding GPS-tagged reference image. Current state-of-the-art methods predominantly rely on training models with labeled paired images, incurring substantial annotation costs and training burdens. In this study, we investigate the adaptation of frozen models for CVGL without requiring ground truth pair labels. We observe that training on unlabeled cross-view images presents significant challenges, including the need to establish relationships within unlabeled data and reconcile view discrepancies between uncertain queries and references. To address these challenges, we propose a self-supervised learning framework to train a learnable adapter for a frozen Foundation Model (FM). This adapter is designed to map feature distributions from diverse views into a uniform space using unlabeled data exclusively. To establish relationships within unlabeled data, we introduce an Expectation-Maximization-based Pseudo-labeling module, which iteratively estimates associations between cross-view features and optimizes the adapter. To maintain the robustness of the FM's representation, we incorporate an information consistency module with a reconstruction loss, ensuring that adapted features retain strong discriminative ability across views. Experimental results demonstrate that our proposed method achieves significant improvements over vanilla FMs and competitive accuracy compared to supervised methods, while necessitating fewer training parameters and relying solely on unlabeled data. Evaluation of our adaptation for task-specific models further highlights its broad applicability.
Paper Structure (24 sections, 8 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of the previous supervised paradigm (a) and the proposed self-supervised paradigm (b).$b_0$ denotes backbone, $f_0$ denotes the frozen model, and $f_\theta$ denotes the learnable adapter.
  • Figure 2: Performance degradation of the foundation model due to the view gap. We perform Drone-to-drone (single-view) and Drone-to-satellite (cross-view) retrieval using the frozen foundation model. The blue curve represents the retrieval accuracy for single-view, while the red curve illustrates the lower accuracy for cross-view. More experimental details are presented in Section \ref{['sec:ablation']}.
  • Figure 3: Overview of the self-supervised cross-view adaptation. In the training phase, the foundation model is frozen and the adapter is trained via the proposed EMPL and AIC modules without ground truth. In the inference phase, the global features of input images are extracted by the frozen foundation modal and the trained adapter for retrieval to final geo-localization.
  • Figure 4: Workflow of EMPL module. The E-step is the pseudo-labeling of the positive pairs, while the M-step is updating the adapter with the supervision of the pseudo-labels.
  • Figure 5: (a) Illustrates how contrastive learning compels $Z_X$ to extract mutual information and discard irrelevant information. (b) Demonstrates that if the matched target $Y$ is unknown, $Z$ may experience a decrease in mutual information and lack discriminativeness. (c) Emphasizes our goal to extract mutual information while preserving some redundancy information for enhanced robustness.
  • ...and 8 more figures