VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

Jun Lu; Zehao Sang; Haoqi Wei; Xiangyun Liu; Kun Zhu; Haitao Guo; Zhihui Gong; Lei Ding

VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

Jun Lu, Zehao Sang, Haoqi Wei, Xiangyun Liu, Kun Zhu, Haitao Guo, Zhihui Gong, Lei Ding

Abstract

Cross-View Geo-Localization (CVGL) in remote sensing aims to locate a drone-view query by matching it to geo-tagged satellite images. Although supervised methods have achieved strong results on closeset benchmarks, they often fail to generalize to unconstrained, real-world scenarios due to severe viewpoint differences and dataset bias. To overcome these limitations, we present VFM-Loc, a training-free framework for zero-shot CVGL that leverages the generalizable visual representations from vision foundational models (VFMs). VFM-Loc identifies and matches discriminative visual clues across different viewpoints through a progressive alignment strategy. First, we design a hierarchical clue extraction mechanism using Generalized Mean pooling and Scale-Weighted RMAC to preserve distinctive visual clues across scales while maintaining hierarchical confidence. Second, we introduce a statistical manifold alignment pipeline based on domain-wise PCA and Orthogonal Procrustes analysis, linearly aligning heterogeneous feature distributions in a shared metric space. Experiments demonstrate that VFM-Loc exhibits strong zero-shot accuracy on standard benchmarks and surpasses supervised methods by over 20% in Recall@1 on the challenging LO-UCV dataset with large oblique angles. This work highlights that principled alignment of pre-trained features can effectively bridge the cross-view gap, establishing a robust and training-free paradigm for real-world CVGL. The relevant code is made available at: https://github.com/DingLei14/VFM-Loc.

VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

Abstract

Paper Structure (21 sections, 10 equations, 6 figures, 8 tables)

This paper contains 21 sections, 10 equations, 6 figures, 8 tables.

Introduction
Related Work
Cross-View Geo-Localization
Unsupervised & Zero-Shot Learning
Methodology
Problem Formulation and System Overview
Hierarchical Extraction of Discriminative Visual Clues
Statistical Manifold Alignment
Experiments
CVGL Benchmarks and Evaluation Metrics
Ablation Study
Effectiveness of the proposed modules:
Visualization on Manifold Alignment:
Visualization of Similarity Responses
Impact of $\alpha$ in R-MAC:
...and 6 more sections

Figures (6)

Figure 1: Key challenges of zero-shot CVGL: disparity in feature distribution. Drone and satellite visual manifolds suffer severe misalignment due to inherently different viewing geometry, thus direct zero-shot matching fails under viewpoint discrepancies.
Figure 2: The proposed VFM-Loc, a zero-shot CVGL framework, operates in three training-free stages: (1) Feature extraction using a VFM; (2) Hierarchical discriminative clue extraction via GeM pooling and scale-weighted R-MAC; (3) Statistical manifold alignment based on domain-wise PCA and orthogonal Procrustes rotation, projecting heterogeneous features into a unified embedding space for retrieval.
Figure 3: Sample image pairs from the constructed LO-UCV dataset. Top: satellite images; Bottom: high-resolution images from fixed-wing UAVs with large tilt (>45°).
Figure 4: t-SNE visualization of cross-view features before and after manifold alignment.
Figure 5: Similarity response visualization before and after manifold alignment.
...and 1 more figures

VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

Abstract

VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

Authors

Abstract

Table of Contents

Figures (6)