Table of Contents
Fetching ...

Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min

TL;DR

WanderBench is introduced, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios, and GeoAoT, a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding, are defined.

Abstract

Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.

Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

TL;DR

WanderBench is introduced, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios, and GeoAoT, a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding, are defined.

Abstract

Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.
Paper Structure (19 sections, 5 equations, 4 figures, 5 tables)

This paper contains 19 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Visual geolocation uses visual clues to determine the location of an image. WanderBench evaluates both the model’s ability to propose locations under different difficulty levels and its geolocation accuracy across diverse scenes. The Action of Thought allows the model to actively explore nearby views to gather more information for a more accurate result. GeoAoT further improves overall geolocation performance, shown by the thicker line in the radar chart along with location proposing performance. All radar chat is min max normalized.
  • Figure 2: Figure 2. (a) Global distribution of model-proposed locations across all continents. (b) Country-level distribution of the proposed locations. (c) Average navigation-graph structure of the WanderBench dataset.
  • Figure 3: Visualization of the average navigation graph structure across all locations in the WanderBench dataset. Node sizes correspond to occurrence frequency within spatial bins, while edge thickness indicates transition frequency between bins.
  • Figure 4: An overview of GeoAoT. Given an input image, GeoAoT first leverages a pre-trained LMMs to generate an initial geo-guess. It then iteratively refines this estimate through AoT based multi-turn interactions, where the model reasons about uncertainty and issues actions to gather more evidence, operating purely at inference time without any additional training.