Table of Contents
Fetching ...

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, Cihang Xie, Yuyin Zhou

TL;DR

EarthWhere introduces a two-scale vision–language geolocation benchmark to probe both final localization and reasoning traces. It couples WhereCountry (MCQA on panoramas) with WhereStreet (fine-grained localization with evidence and possible web search) and introduces Thinking Score via Shapley-weighted clue attribution. The results reveal closed models dominate, web retrieval yields inconsistent gains, and regional biases persist, underscoring the need for bias-aware, evidence-grounded localization. The work provides open-source data and protocols to enable fair, reproducible evaluation and progress in VLM geolocation.

Abstract

Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

TL;DR

EarthWhere introduces a two-scale vision–language geolocation benchmark to probe both final localization and reasoning traces. It couples WhereCountry (MCQA on panoramas) with WhereStreet (fine-grained localization with evidence and possible web search) and introduces Thinking Score via Shapley-weighted clue attribution. The results reveal closed models dominate, web retrieval yields inconsistent gains, and regional biases persist, underscoring the need for bias-aware, evidence-grounded localization. The work provides open-source data and protocols to enable fair, reproducible evaluation and progress in VLM geolocation.

Abstract

Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.

Paper Structure

This paper contains 28 sections, 5 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Illustration of a complete search and reasoning process for a EarthWhere sample.
  • Figure 2: All locations in EarthWhere shown on a global map.
  • Figure 4: Overall performance combining both the WhereCountry and WhereStreet results.
  • Figure 5: Main results on WhereCountry ranked by accuracy. Closed-source models lead by a large margin. Neither web search nor deeper reasoning consistently improves performance.
  • Figure 6: Effect of number of human-annotated key clues as extra context.
  • ...and 1 more figures