Table of Contents
Fetching ...

Scaling Image Geo-Localization to Continent Level

Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys, Simon Lynen, Eduard Trulls

TL;DR

This work tackles fine-grained image geolocalization at continental scales, addressing the寿 challenge of achieving meter-level accuracy without relying on dense ground-truth priors. It introduces a hybrid framework that learns rich ground-view feature prototypes via a proxy classification task and fuses them with aerial embeddings in per-cell codes, enabling scalable and precise cross-view retrieval across vast regions. By training with a triad of embeddings (ground, aerial, prototypes) under a multi-similarity loss and interpolating cell boundaries, the approach achieves strong continent-wide performance, demonstrates cross-area and cross-domain generalization, and shows robustness to data sparsity and viewpoint changes. The extensive experiments on Western Europe (BEDENL and EuropeWest) establish substantial gains over baselines, with effective scalability to millions of cells and practical implications for geolocation, navigation, and safety applications, while also acknowledging potential privacy concerns and the need for responsible deployment.

Abstract

Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.

Scaling Image Geo-Localization to Continent Level

TL;DR

This work tackles fine-grained image geolocalization at continental scales, addressing the寿 challenge of achieving meter-level accuracy without relying on dense ground-truth priors. It introduces a hybrid framework that learns rich ground-view feature prototypes via a proxy classification task and fuses them with aerial embeddings in per-cell codes, enabling scalable and precise cross-view retrieval across vast regions. By training with a triad of embeddings (ground, aerial, prototypes) under a multi-similarity loss and interpolating cell boundaries, the approach achieves strong continent-wide performance, demonstrates cross-area and cross-domain generalization, and shows robustness to data sparsity and viewpoint changes. The extensive experiments on Western Europe (BEDENL and EuropeWest) establish substantial gains over baselines, with effective scalability to millions of cells and practical implications for geolocation, navigation, and safety applications, while also acknowledging potential privacy concerns and the need for responsible deployment.

Abstract

Determining the precise geographic location of an image at a global scale remains an unsolved challenge. Standard image retrieval techniques are inefficient due to the sheer volume of images (>100M) and fail when coverage is insufficient. Scalable solutions, however, involve a trade-off: global classification typically yields coarse results (10+ kilometers), while cross-view retrieval between ground and aerial imagery suffers from a domain gap and has been primarily studied on smaller regions. This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent. We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information. We combine these learned prototypes with embeddings of aerial imagery to increase robustness to the sparsity of ground-level data. This enables direct, fine-grained retrieval over areas spanning multiple countries. Our extensive evaluation demonstrates that our approach can localize within 200m more than 68\% of queries of a dataset covering a large part of Europe. The code is publicly available at https://scaling-geoloc.github.io.

Paper Structure

This paper contains 55 sections, 2 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: Large-scale fine-grained geolocalization. We introduce an approach that can localize a ground-level image within 100m at the scale of a continent (here Western Europe) by combining the scalability and robustness of classification with the precision of cross-view ground-aerial retrieval. All images shown here are misregistered by either paradigm, but correctly localized by ours.
  • Figure 2: Localization process. The prototypes $\bm{\mathrm{z}}^{\mathrm{P}}$ are extracted from the model weights $\Phi$ and upsampled to the target resolution using the S2Cell hierarchy. Aerial tiles roughly covering the cell are encoded using the aerial encoder $\Phi^{\mathrm{A}}$ and concatenated. Both databases are combined per-cell using the calibration factor $\kappa$, resulting in the final database of cell codes $\bm{\mathrm{z}}^{\mathrm{cell}}$. During inference (right), we extract the embedding of a query image $\bm{\mathrm{z}}^{\mathrm{Q}}$ with $\Phi^{\mathrm{G}}$ and we compute the similarity to all cell codes $\{ \bm{\mathrm{z}}^{\mathrm{cell}}_j {}^{\top} \bm{\mathrm{z}}^{\mathrm{Q}} \}$. The estimated location is the cell with the highest similarity.
  • Figure 3: Supervision: We train query, aerial, and prototype embeddings, $\bm{\mathrm{z}}^{\mathrm{Q}}$, $\bm{\mathrm{z}}^{\mathrm{A}}$ and $\bm{\mathrm{z}}^{\mathrm{P}}$, to be similar for corresponding locations and different otherwise. We interpolate prototypes to account for the coarseness of their cells.
  • Figure 4: Impact of the density of the training data. We slice the recall@$K$@200m on EuropeWest (Table \ref{['tbl:europe-and-cross-area']}) by the temporal (left) and spatial (right) density of StreetView images within $L$=15 cells. We compare our full (hybrid) model (blue) with one relying only on ground-level images (red). The aerial embeddings help improve the accuracy especially when ground-level data is sparse.
  • Figure 5: Left: PCA visualization of the learned prototypes, which appear in different colors for e.g., urban, forested, or coastal areas. The high-frequency noise suggests that they also encode local distinctive information. Right: Test queries that are successfully localized (${\color{ForestGreen}\bullet}$) are uniformly distributed over the map, while failures (${\color{red}\bullet}$) are prevalent in rural areas, where training data is sparser.
  • ...and 13 more figures