Table of Contents
Fetching ...

From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

Jeongho Min, Dongyoung Kim, Jaehyup Lee

TL;DR

This paper tackles cross-view street-to-satellite retrieval without supervised training, enabling retrieval from monocular street images by inferring location semantics with a large language model and then geocoding to obtain satellite queries. It combines a frozen pretrained vision encoder (e.g., DINOv2) with a PCA-whitening refinement to bridge the ground–satellite domain gap, achieving state-of-the-art zero-shot results on University-1652. The approach also supports automatic generation of semantically aligned Street-to-Satellite datasets, offering scalable data construction without manual annotation. Overall, the method demonstrates that training-free, language-guided localization and robust feature refinement can rival supervised Cross-view retrieval while enabling practical deployment and scalable data generation.

Abstract

Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

TL;DR

This paper tackles cross-view street-to-satellite retrieval without supervised training, enabling retrieval from monocular street images by inferring location semantics with a large language model and then geocoding to obtain satellite queries. It combines a frozen pretrained vision encoder (e.g., DINOv2) with a PCA-whitening refinement to bridge the ground–satellite domain gap, achieving state-of-the-art zero-shot results on University-1652. The approach also supports automatic generation of semantically aligned Street-to-Satellite datasets, offering scalable data construction without manual annotation. Overall, the method demonstrates that training-free, language-guided localization and robust feature refinement can rival supervised Cross-view retrieval while enabling practical deployment and scalable data generation.

Abstract

Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between conventional contrastive learning-based cross-view retrieval (top) and our proposed training-free, LLM-guided framework (bottom). Unlike prior work requiring separate encoders and contrastive training, our approach uses a single pretrained vision encoder (e.g., DINOv2) for both views, enabling direct retrieval in a shared feature space.
  • Figure 2: Overview of our proposed training-free cross-view retrieval pipeline. Given a street-level image, we first extract textual context using Google Image Search, then infer the most specific geolocatable name via a large language model(LLM). This name is geocoded into coordinates to retrieve a satellite tile, which is embedded using a pretrained vision encoder (e.g., DINOv2). The resulting feature is whitened and matched against a satellite gallery using similarity-based retrieval. Our method enables semantically aligned cross-view matching without any supervised training.
  • Figure 3: Overview of our dataset generation pipeline. We prompt an LLM to produce a list of globally recognizable locations (e.g., "Marina Bay Sands", "Tokyo Metropolitan Government Building"). For each location, we retrieve street-view images via Google Image Search and obtain geographic coordinates using a Google Geocoding API. These coordinates are used to generate corresponding satellite tiles through the Google Static Maps API. This process produces high-quality, semantically aligned street-to-satellite image pairs at scale.
  • Figure 4: Qualitative results of our training-free retrieval framework. Given a street-level query image (left), our method retrieves the most semantically aligned satellite tile (right) using a pipeline combining Google Image Search, an LLM-based geolocation inference, geocoding, and DINOv2 with PCA-whitening. The red bounding boxes in each satellite image indicate the correctly retrieved region corresponding to the inferred location. The top row shows accurate retrieval of a domed building with distinctive geometry; the second row demonstrates robustness in a greenery scene with minimal context, such as a house; the third row highlights generalization to an urban street with ambiguous visual representations. These samples illustrate our proposed framework's ability to perform reliable cross-view matching across diverse environments, without any kind of supervised training.