Table of Contents
Fetching ...

LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

Zhangyu Wang, Zeping Liu, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, Gengchen Mai

TL;DR

LocDiff introduces a dense, multi-scale latent diffusion framework for image geolocalization by embedding spherical locations into a SHDD space built from Spherical Harmonics Dirac Delta functions. A CS-UNet learns the conditional backward diffusion in this space, with SHDD-KL as a stable training objective and a learning-free mode-seeking SHDD Decoder to map representations back to geolocations. The approach achieves state-of-the-art results across five global datasets and demonstrates superior generalization to unseen locations, with a flexible hybrid variant LocDiff-H that leverages retrieval for fine-scale accuracy. This method offers robust, grid- and gallery-free location generation and has potential to improve real-world geo-context tasks by providing dense, scalable location modeling and efficient inference.

Abstract

Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test images does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

TL;DR

LocDiff introduces a dense, multi-scale latent diffusion framework for image geolocalization by embedding spherical locations into a SHDD space built from Spherical Harmonics Dirac Delta functions. A CS-UNet learns the conditional backward diffusion in this space, with SHDD-KL as a stable training objective and a learning-free mode-seeking SHDD Decoder to map representations back to geolocations. The approach achieves state-of-the-art results across five global datasets and demonstrates superior generalization to unseen locations, with a flexible hybrid variant LocDiff-H that leverages retrieval for fine-scale accuracy. This method offers robust, grid- and gallery-free location generation and has potential to improve real-world geo-context tasks by providing dense, scalable location modeling and efficient inference.

Abstract

Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test images does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

Paper Structure

This paper contains 38 sections, 16 equations, 6 figures, 11 tables, 2 algorithms.

Figures (6)

  • Figure 1: Multi-scale latent diffusion for image geolocalization. Black solid/dotted arrows denote encoding/decoding steps. Orange modules are learnable, while blue modules are deterministic with no learning parameters. (1) It is difficult to diffuse in the position encoding spacemai2023sphere2vec because valid positional encodings are sparse which leads to difficulties in diffusion model training and decoding. (2) The locational embedding space is dense and can perform diffusion processes, but the non-linear mapping between the position encoding and location embedding space makes decoding back to a correct coordinates extremely difficult. Minimizing distances in the location embedding space may not minimize geographic distance. (3) The SHDD encoding space is dense -- every point $\mathbf{e}$ in this encoding space corresponds to a spherical function $F_{\mathbf{e}}$, whose difference from the spherical Dirac delta function $\delta_{(\theta_0, \phi_0)}$ of the ground truth location $(\theta_0, \phi_0)$ is measured by the reverse KL-divergence $\mathcal{E}$. (4) The SHDD decoding addresses the non-linearity problem. The heatmaps (4a), (4b) represents the distance from the spherical point represented by the embedding/encoding to the yellow star point in the middle. The distance measured by SHDD is significantly smoother.
  • Figure 2: (a): The architecture of Condition SirenNet Module (C-Siren). $x$ is the input latent vector, $x^{'}$ is the output latent vector, $t$ is the scalar timestep, and $e_I$ is the embedding of the input image. $d_i$ is the input dimension, $d_o$ is the output dimension, $d_T$ is the time embedding dimension, $d_I$ is the conditional embedding dimension. (b): The architecture of Conditional SirenNet-Based UNet (CS-UNet) and the workflow of LocDiff. $d$ is the latent dimension. The numbered circles denote the order of training steps.
  • Figure 3: Illustration of how the spherical probability mass concentration (mapped to a plane) corresponding to the SHDD encodings changes along the backward process at step=0, 10, 20, 100, 200, respectively. The more bright, the more probability mass.
  • Figure 4: Illustration of the spatial resolutions with $L=15$, $L=23$ and $L=31$. The bright regions are the probability mass concentrations and points within these regions are similarly likely to be decoded as the location predictions. The smaller the bright regions are, the lower errors the SHDD decoding brings.
  • Figure 5: (a): An illustration of how the image geolocalization performance on the Im2GPS3K dataset increases as L increases. Different curves indicate performance metrics on different spatial scales. (b): A log-scale plot of the maximum absolute values of each SHDD encoding dimension up to 64$\times$64 = 4096 dimensions.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 4.1: Coordinate Space
  • Definition 4.2: Position Encoding and Position Decoding