Table of Contents
Fetching ...

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

Abstract

Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Abstract

Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

Paper Structure

This paper contains 17 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Two paradigms for cross-view geo-localization. (a) Retrieval-based approaches maintain a large GPS-tagged reference database and use a contrastively trained encoder to perform nearest-neighbor search at query time. (b) Autoregressive geo-localization (ours) replaces global retrieval with sequential, coarse-to-fine zoom-in decisions over multi-scale satellite imagery, dramatically reducing dependence on exhaustive database search.
  • Figure 2: Coverage mismatch. In the street-view image (a), the stadium and its distinctive gate are clearly visible. In the satellite crop (b) from the same region, this landmark falls outside of the patch, illustrating how small tiles can miss critical cues present in the street-view. Images sourced from Google Street View and Google Maps google_maps_static_apigoogle_streetview_static_api.
  • Figure 3: Dataset examples. Street-view samples from our cross-view image localization corpus illustrating the two defining characteristics: (i) limited FoV typical of first-person and dash-cam capture; and (ii) broad diversity across time of day, seasonal appearance, weather, scene type , and capture platforms . This variability-together with viewpoint changes, occlusions, and motion blur-widens the appearance gap to overhead imagery and makes cross-view matching more challenging.
  • Figure 4: Overview of Just Zoom In. (a) Image encoding. A shared, off-the-shelf vision encoder maps the street-view image $I_g$ and the multi-scale satellite maps $\{M_t\}$ to global representation tokens $e_g$ and $e_t$, single token per image. (b) Autoregressive action modeling. The image tokens, interleaved with previously chosen action tokens, are fed to a causal transformer that predicts a distribution over the next zoom-in action $a_t$ at each step. (c) Sample / zoom-in sequence. Beginning from $M_0$, the model selects the most probable patch, zooms to obtain $M_{t+1}$, and iterates until a terminal patch $M_N$; the center of $M_N$ is taken as the location estimate.
  • Figure 5: Focus at different zoom levels. We visualize similarity maps between street-view patch tokens and the satellite [CLS] embedding (blue = low similarity, red = high similarity) across three zoom levels. At the coarsest level, high similarity concentrates on far-away regions, whereas at finer zoom levels it shifts toward nearby objects and road/building details, indicating that the model learns to exploit different scale-specific visual cues.
  • ...and 1 more figures