Table of Contents
Fetching ...

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu

TL;DR

This work introduces a generative framework for global visual geolocation that respects Earth's spherical geometry by leveraging diffusion in $\mathbb{R}^3$ and Riemannian flow matching on $\mathcal{S}^2$. By training a conditioned denoiser $\psi$ to predict noise or velocity fields, the model generates location trajectories whose endpoints provide location estimates and full conditional densities $p(y\mid c)$. The approach yields state-of-the-art results on OSV-5M, iNat21, and YFCC4k, and enables probabilistic visual geolocation with calibrated metrics such as NLL, localizability, density, and coverage. The paper also introduces classifier-free guidance to sharpen distributions and provides detailed implementation and theoretical notes on spherical geometry and density estimation on manifolds, highlighting the practical impact for uncertainty-aware geolocation tasks.

Abstract

Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

TL;DR

This work introduces a generative framework for global visual geolocation that respects Earth's spherical geometry by leveraging diffusion in and Riemannian flow matching on . By training a conditioned denoiser to predict noise or velocity fields, the model generates location trajectories whose endpoints provide location estimates and full conditional densities . The approach yields state-of-the-art results on OSV-5M, iNat21, and YFCC4k, and enables probabilistic visual geolocation with calibrated metrics such as NLL, localizability, density, and coverage. The paper also introduces classifier-free guidance to sharpen distributions and provides detailed implementation and theoretical notes on spherical geometry and density estimation on manifolds, highlighting the practical impact for uncertainty-aware geolocation tasks.

Abstract

Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.

Paper Structure

This paper contains 48 sections, 2 theorems, 39 equations, 9 figures, 4 tables.

Key Result

Proposition 1

Given a location $y \in \mathcal{S}^2$ and an image $c$, consider solving the following ordinary differential equation system for $t$ from $0$ to $1$: Then the log-probability density of $y$ given $c$ is: $\log p(y \mid c) = \log p_\epsilon(x(1) \mid c) - f(1)$ where $p_\epsilon$ is the known distribution of the pure noise $\epsilon$, and $f(t)$ accumulates the negative divergence of the velocity

Figures (9)

  • Figure 1: Geolocation as a Generative Process. We explore diffusion and flow matching for visual geolocation by sampling and denoising random locations. This process generates trajectories onto the Earth's surface, whose endpoints provide location estimates. Our models also provide probability densities for every possible image locations. We illustrate these trajectories and the log-densities for three images from different datasets: an Andean condor from iNat21 van2021benchmarking, an African open-air market from YFCC-100M YFCC, and a dashcam snapshot from OSV-5M astruc2024openstreetview. The predicted image locations are indicated by and the true ones by .
  • Figure 2: Generative Framework. We implement three generative approaches for geolocation: diffusion in $\mathbb{R}^3$, flow matching in $\mathbb{R}^3$, and Riemannian flow matching directly on $\mathcal{S}_2$. This figure provides the formulas for the noising processes and the loss functions for each approach.
  • Figure 3: Inference Pipeline. We start by embedding the image to be localized into a vector using a frozen image encoder. We then sample a random noise $\epsilon$ in $\mathbb{R}^3$ or on $\mathcal{S}_2$, projected here onto the sphere. We iteratively remove the noise using either the reverse diffusion or flow-matching equations for $t=1$ to $0$. The final point of this trajectory is our predicted location. Additionally, our model be queried to predict a probability distribution at any point on the sphere by solving an Ordinary Differential Equation (ODE) system.
  • Figure 4: Scheduler. We chose a noise scheduler that assigns more weights to the beginning of the diffusion process.
  • Figure 5: Impact of Number of Timesteps. We represent different metrics on OpenStreetView-5M with different numbers of timesteps for the Riemannian Flow matching model.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • proof