OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Guillaume Astruc; Nicolas Dufour; Ioannis Siglidis; Constantin Aronssohn; Nacim Bouia; Stephanie Fu; Romain Loiseau; Van Nguyen Nguyen; Charles Raude; Elliot Vincent; Lintao XU; Hongyu Zhou; Loic Landrieu

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Constantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vincent, Lintao XU, Hongyu Zhou, Loic Landrieu

TL;DR

OpenStreetView-5M introduces a global, open-access street-view geolocation dataset with a strict train/test split to evaluate geographical generalization rather than memorization. The authors benchmark a broad space of image encoders, spatial representations, and training strategies, showing that a carefully combined model—featuring a DATA_COMP-pretrained ViT-L-14 backbone, QuadTree-based hierarchical/hybrid supervision, and region-contrastive fine-tuning—achieves substantial gains over baselines. The work provides a rigorous framework for evaluation (Geoscore, Haversine distance, and admin-level accuracy), demonstrates the value of hierarchical and hybrid approaches, and highlights OSV-5M’s potential for self-supervised learning and generative modeling, along with a transparent dataset datasheet and extensive ablations. By balancing global geographic coverage with clean train/test separation and rich metadata, OSV-5M is positioned to advance robust geographic representation learning and fair benchmarking across geolocation research.

Abstract

Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at https://github.com/gastruc/osv5m.

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

TL;DR

Abstract

Paper Structure (56 sections, 1 equation, 14 figures, 9 tables)

This paper contains 56 sections, 1 equation, 14 figures, 9 tables.

Introduction
Related Work
Localizability
Geolocation Datasets
Web-Scraped.
Street View.
Geolocation Methods
Image Retrieval-Based Approaches.
Classification-Based Approaches.
Hybrid Approaches.
OpenStreetView-5M
Benchmark
Evaluation Metrics.
Framework
Implementation details.
...and 41 more sections

Figures (14)

Figure 1: Global Visual Geolocation. Predicting the location of an image taken anywhere in the world from just pixels requires detecting a combination of clues of various abstraction levels mehta2016exploratory. Can you guess where these images were taken?
Figure 2: Localizable vs Non-Localizable. Images from our dataset (green) occupy the space between weakly localizable images (red) like the ones from the test set of Im2GPS3k Im2GPS++YFCC4k+Im2GPS3k and landmark images used to advertise CV conferences (blue).
Figure 3: OpenStreetView-5M. Image density and proportions per country and continent for the train and test sets. To ensure an unbiased evaluation, we prioritize the uniformity of the test set's distribution across the globe over the training set distribution.
Figure 4: Visual Geolocation Model. We propose a simple and versatile framework for visual geolocation and explore the impact of various components of this approach in train-test performance on OpenStreetView-5M. Starting from the left, the input image is converted to a vector representation by an image encoder $f^{\text{img}}$ (red). Then a geolocation head $f^{\text{loc}}$ maps this vector to a set of geographical predictions (mint). Then a contrastive objective is potentially added (cyan), as well as auxiliary targets to learn better representations for geolocation (lila). We also consider various parameter fine-tuning strategies for training our image encoder, by freezing all or part of $f^{\text{img}}$ (yellow).
Figure 5: Spatial Distribution of Errors. We plot the average prediction error of the combined model in km across the globe.
...and 9 more figures

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

TL;DR

Abstract

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)