Table of Contents
Fetching ...

GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du

TL;DR

GeoBridge introduces a semantic-anchored, multi-view foundation model for cross-view and cross-modal geo-localization that performs bidirectional matching across drone, street-view, and satellite imagery and supports language-to-image retrieval. It shifts away from satellite-centric localization by using a shared semantic bridge to align three visual views and their text descriptions, enabling robust localization even when satellite data are unavailable. The GeoLoc dataset, with 52,679 tri-view triplets across 36 countries and unified textual descriptors, underpins pre-training and yields strong gains in both cross-view and cross-modal tasks, demonstrating broad generalization and transfer. Overall, the approach offers improved localization accuracy, cross-domain robustness, and practical applicability for UAV navigation, disaster response, and multi-sensor fusion, with public release of data and code to support reproducibility.

Abstract

Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.

GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

TL;DR

GeoBridge introduces a semantic-anchored, multi-view foundation model for cross-view and cross-modal geo-localization that performs bidirectional matching across drone, street-view, and satellite imagery and supports language-to-image retrieval. It shifts away from satellite-centric localization by using a shared semantic bridge to align three visual views and their text descriptions, enabling robust localization even when satellite data are unavailable. The GeoLoc dataset, with 52,679 tri-view triplets across 36 countries and unified textual descriptors, underpins pre-training and yields strong gains in both cross-view and cross-modal tasks, demonstrating broad generalization and transfer. Overall, the approach offers improved localization accuracy, cross-domain robustness, and practical applicability for UAV navigation, disaster response, and multi-sensor fusion, with public release of data and code to support reproducibility.

Abstract

Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.

Paper Structure

This paper contains 36 sections, 6 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Overall workflow. Left: multi-view data processing for the GeoLoc dataset. Right: the GeoBridge method. (a) Global distribution of multi-view image groups (darker shades indicate higher density). (b) Counts of drone images per ground-footprint bin as a function of covered area (m$^2$).
  • Figure 2: Qualitative image retrieval results on the GeoLoc dataset. The red boxes indicate the true-matched images.
  • Figure 3: Qualitative results for cross-modal geo-location. Using street view descriptions to match drone perspectives, the top three results are reported; red boxes indicate correct matches.
  • Figure 4: Examples of original drone images.
  • Figure 5: Examples of basic validity screening.
  • ...and 11 more figures