Table of Contents
Fetching ...

Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

Yancheng Zhang, Xiaohan Zhang, Guangyu Sun, Zonglin Lyu, Safwan Wshah, Chen Chen

Abstract

Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.

Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

Abstract

Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.

Paper Structure

This paper contains 24 sections, 17 equations, 17 figures, 14 tables, 1 algorithm.

Figures (17)

  • Figure 1: Illustration of directly using VGGT on satellite (a) and ground (c) images, leading to incorrect reconstructed shown in (b).
  • Figure 2: Overview of the Geo$^\textbf{2}$ framework. We first extract geometric features from ground and satellite images using VGGT. These dense features are then embedded into a shared geometry-aware latent space as detailed in Sec. \ref{['sec:3map']}. The resulting embeddings, $f^g$ and $f^s$, are used for both CVGL and CVIS. While Geo$^\textbf{2}$ supports bidirectional image synthesis, it only requires training in the ground-to-satellite (G2S) direction. As detailed in Sec. \ref{['sec:3flow']}, only ground images are needed as input during inference for G2S generation, and vice versa.
  • Figure 3: Illustration of VGGT reconstructions for (a) the ground view and (b) the satellite view, showing strong geometric alignment (e.g., buildings and overall layout). The ground view reconstruction is obtained from four perspective crops, illustrated in (c).
  • Figure 4: Overview of GeoMap pipeline. Ground and satellite images are individually processed via two separate branches.
  • Figure 5: Overview of our GeoFlow pipeline. The latent representation ($f^g$ if ground-to-satellite synthesis or $f_s$ if satellite-to-ground synthesis) is input as condition $C$.
  • ...and 12 more figures