Table of Contents
Fetching ...

NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber

TL;DR

This work tackles image geo-localization by enhancing reasoning with language and external knowledge. It introduces NaviClues, a GeoGuessr-derived reasoning dataset, and Navig, a Reasoner–Searcher–Guesser framework that grounds image details with tools like maps and guidebooks. Navig, trained on NaviClues with LoRA, delivers state-of-the-art accuracy on open benchmarks while using less than $1000$ training samples, and provides interpretable reasoning traces. The approach demonstrates that strategically integrated reasoning and external knowledge dramatically improve geo-localization and offer a path toward more transparent, data-efficient spatial understanding with vision-language models.

Abstract

Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.

NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

TL;DR

This work tackles image geo-localization by enhancing reasoning with language and external knowledge. It introduces NaviClues, a GeoGuessr-derived reasoning dataset, and Navig, a Reasoner–Searcher–Guesser framework that grounds image details with tools like maps and guidebooks. Navig, trained on NaviClues with LoRA, delivers state-of-the-art accuracy on open benchmarks while using less than training samples, and provides interpretable reasoning traces. The approach demonstrates that strategically integrated reasoning and external knowledge dramatically improve geo-localization and offer a path toward more transparent, data-efficient spatial understanding with vision-language models.

Abstract

Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: In image geo-localization, models need to find both cultural and geographical clues to infer correct locations. External tools like maps and guidebooks can also be helpful by providing extra knowledge.
  • Figure 2: Top clues in human reasoning. Humans identify roads, cars, poles, and linguistic clues---specifically the languages on plates, signs and houses.
  • Figure 3: The framework of Navig comprises three main components: the Reasoner, which handles general reasoning; the Searcher, which leverage external knowledge for detail-specific analysis, and the Guesser, which combines outputs from both analyzers to generate predictions.
  • Figure 4: Top: The model uses visual details and OpenStreetMap to accurately determine the location. Middle: The model is misled by linguistic elements---the shop name, resulting in an incorrect inference. Bottom: The model found a namesake when using OpenStreetMap.
  • Figure 5: Location distribution of NaviClues, covering a wide range of countries around the world.
  • ...and 1 more figures