Table of Contents
Fetching ...

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, Weijia Li

TL;DR

This work formalizes cross-view geo-localization driven by natural language descriptions and introduces the CVG-Text dataset, enabling text-guided retrieval of satellite or OSM images. It presents CrossText2Loc, a long-text friendly retrieval model that uses Extended Embedding and a contrastive learning objective, along with an Explainable Retrieval Module to provide natural-language justifications and confidence scores. The dataset is generated with a progressive GPT-4o-based pipeline enhanced by OCR and open-world segmentation, and validated with strong recall gains over baselines across multiple cities. Together, these contributions offer a scalable, interpretable framework for text-based geo-localization with practical implications for navigation and emergency response.

Abstract

Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://yejy53.github.io/CVG-Text/ .

Where am I? Cross-View Geo-localization with Natural Language Descriptions

TL;DR

This work formalizes cross-view geo-localization driven by natural language descriptions and introduces the CVG-Text dataset, enabling text-guided retrieval of satellite or OSM images. It presents CrossText2Loc, a long-text friendly retrieval model that uses Extended Embedding and a contrastive learning objective, along with an Explainable Retrieval Module to provide natural-language justifications and confidence scores. The dataset is generated with a progressive GPT-4o-based pipeline enhanced by OCR and open-world segmentation, and validated with strong recall gains over baselines across multiple cities. Together, these contributions offer a scalable, interpretable framework for text-based geo-localization with practical implications for navigation and emergency response.

Abstract

Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://yejy53.github.io/CVG-Text/ .

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Towards Text-guided Geo-localization. In scenarios where GPS signals are interfered with, users must describe their surroundings using natural language, providing various location cues to determine their position (Up). To address this, we introduce a text-based cross-view geo-localization task, which retrieves satellite imagery or OSM data only based on text queries for position localization (Down).
  • Figure 2: Textual Feature Statistics Overview.(a) t-SNE visualization of text data from different cities; (b) text similarity matrix; (c) token length distribution histogram; (d) comparison of text statistics across different datasets.
  • Figure 3: Overall Process for Street-View Text Description Generation using GPT-4o.
  • Figure 4: The proposed CrossText2Loc method. Street-view texts serve as query inputs, with satellite and OSM images as references.
  • Figure 5: Qualitative retrieval results on CVG-Text Dataset. The left side of the figure displays the original street-view data, synthetic text data with corresponding response heatmaps, and retrieval reason provided by our ERM module. The right side shows the top three retrieval results with corresponding response heatmaps; green indicates correct matches and red denotes incorrect results.
  • ...and 1 more figures