Table of Contents
Fetching ...

Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

Chandrakanth Gudavalli, Tajuddin Manhar Mohammed, Abhay Yadav, Ananth Vishnu Bhaskar, Hardik Prajapati, Cheng Peng, Rama Chellappa, Shivkumar Chandrasekaran, B. S. Manjunath

TL;DR

Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery, and MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments are introduced.

Abstract

Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.

Wrivinder: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

TL;DR

Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery, and MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments are introduced.

Abstract

Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth--based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task, which lacks suitable benchmarks, we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30\,m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization.
Paper Structure (24 sections, 12 equations, 5 figures, 2 tables)

This paper contains 24 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the satellite-to-ground image alignment pipeline. Directly aligning ground images to satellite views is impractical due to large viewpoint and scale differences. Wrivinder aggregates information from multiple ground images to reconstruct a 3D scene, generates a zenith-view rendering, and aligns it to the satellite image using the estimated metric dimensions in meters.
  • Figure 2: MC-Sat Dataset Overview, showing scenes from the ULTRAA, VisymScenes, and JHU-Ames datasets. The central image in each tile is a satellite view of the scene, surrounded by corresponding ground images illustrating the diversity of viewpoints and environments.
  • Figure 3: Overview of Wrivinder, a zero-shot, training-free pipeline for geo-locating ground images on a geo-registered satellite map. Given an unordered set of ground images, the pipeline reconstructs a sparse 3D scene via SfM and densifies it using 3D Gaussian Splatting. The Zenith Viewpoint Extractor estimates the vertical direction and generates a top-down zenith render. The Metric Mapper uses monocular depth priors to recover approximate metric scale and determine the physical footprint of the zenith view. A test-time Deep Template Matcher (DTM) aligns this render to the satellite image, and the resulting correspondences are back-projected through the 3DGS and SfM models via the Gaussian Splat Geolocator to estimate GPS positions for all ground cameras.
  • Figure 4: Key intermediate outputs of Wrivinder, showing semantic maps, the SfM point cloud, semantified reconstruction, metric depth maps, and the resulting metric-scaled zenith render.
  • Figure 5: Satellite–render pairs for several MC-Sat scenes. In each case, the left image shows the satellite view (with blue dots indicating the ground-truth camera locations) and the right image shows the corresponding 3DGS zenith rendering produced by Wrivinder. Gaps, blurring, and missing structures in the reconstruction often make alignment more ambiguous and are a primary source of the higher errors observed in some scenes.