Table of Contents
Fetching ...

Zero-shot Vision-Language Reranking for Cross-View Geolocalization

Yunus Talha Erzurumlu, John E. Anderson, William J. Shuart, Charles Toth, Alper Yilmaz

Abstract

Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.

Zero-shot Vision-Language Reranking for Cross-View Geolocalization

Abstract

Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance or no change at all. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that, these VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.

Paper Structure

This paper contains 21 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of the cross-view geolocalization challenge. a. Ground-level query images. b. Several aerial image candidates, only one of which is the correct match, demonstrating the difficulty caused by viewpoint and appearance differences.
  • Figure 2: The proposed two-stage framework. Stage 1 uses a SOTA retrieval model to get top-$K$ candidates . Stage 2 uses a VLM reranker to produce the final ranked list.
  • Figure 3: Analysis of pointwise reranking failures. Plots (a–b) show score distributions for direct score prediction, (c–d) for Likert-scale prediction, (e–f) for Yes/No prediction, and (g–h) for Yes/No prediction with explicit reasoning. In all cases, the distributions for correct and incorrect candidates strongly overlap, indicating that the VLM-assigned scores (whether direct, Likert, or Yes/No) do not provide a clear, separable signal to distinguish the true match from other plausible-but-incorrect candidates.
  • Figure 4: Main Recall Performance Comparison. This plot shows Recall@1, Recall@3, and Recall@5 for the baseline retrieval model and all VLM reranking strategies, sorted by R@1 performance. LLaVA Pairwise (64.80% R@1) is the only method to achieve a significant improvement over the Baseline (61.20% R@1). In contrast, all other pointwise methods (Direct, Likert, Yes/No) cause a catastrophic drop in performance, or no change at all.
  • Figure 5: Some qualitative results comparing the base model's ranking and the VLM's final selection. In each example, the satellite images are displayed in the order originally produced by the base model, not in the reranked order. The blue outlined image indicates the VLM's top-1 selection among these candidates.