Table of Contents
Fetching ...

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang, Jing Liu, Xiaohong Liu, Guangtao Zhai

TL;DR

GeoX-Bench addresses the challenge of cross-view geo-localization and geo-pose estimation for large multimodal models. It introduces a large-scale, cross-view dataset with ground-to-satellite panoramic pairs and a vast QA corpus across 128 cities, defining seven tasks that jointly test localization and orientation reasoning. Evaluating 25 state-of-the-art LMMs and instruction-tuned variants, the study finds that geo-localization is easier than pose estimation, and that instruction tuning yields substantial gains though pose estimation remains challenging. The benchmark provides a critical platform for advancing geometric reasoning in embodied AI and autonomous navigation, and offers baseline analyses on model biases, scaling effects, and the impact of task-specific training.

Abstract

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

TL;DR

GeoX-Bench addresses the challenge of cross-view geo-localization and geo-pose estimation for large multimodal models. It introduces a large-scale, cross-view dataset with ground-to-satellite panoramic pairs and a vast QA corpus across 128 cities, defining seven tasks that jointly test localization and orientation reasoning. Evaluating 25 state-of-the-art LMMs and instruction-tuned variants, the study finds that geo-localization is easier than pose estimation, and that instruction tuning yields substantial gains though pose estimation remains challenging. The benchmark provides a critical platform for advancing geometric reasoning in embodied AI and autonomous navigation, and offers baseline analyses on model biases, scaling effects, and the impact of task-specific training.

Abstract

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

Paper Structure

This paper contains 28 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Geographic composition of the GeoX-Bench dataset by land cover type. The benchmark is weighted toward developed regions, with Urban/Built-up (36.33%) and Residential/Commercial (27.75%) areas constituting the majority. Natural and rural environments, including Forests (14.26%), Agriculture (12.60%), and Rangeland/Barren (9.05%), provide a geospatially representative representative diverse settings for evaluation.
  • Figure 2: Illustration of the GeoX-Bench benchmark tasks. The tasks include heading estimation with known or unknown camera locations, location verification on a satellite map, location selection, and map selection from candidates. These tasks evaluate models' abilities to reason over ground-to-satellite image pairs for localization and pose understanding.
  • Figure 3: The GeoX-Bench data curation pipeline, from source sampling to final quality control. We first sample ground-satellite pairs from four existing datasets to ensure broad geographic coverage. In the pre-processing stage, ground-level panoramas are programmatically rotated to align to a consistent North orientation before cardinal views are extracted, while corresponding satellite imagery is stitched and cropped. An iterative, LLM-assisted framework with human oversight is used for question prompt generation. A final quality control stage removes data with visual artifacts or cross-view inconsistencies, such as spatial or temporal mismatches, to ensure benchmark integrity.
  • Figure 4: Model scale mitigates choice bias across different tasks and model families. (a, b) The min-max normalized radar charts show that smaller models exhibit a strong preference for a single option on both the Cross-Map Retrieval and Fixed-Pose Heading Estimation tasks. This bias diminishes as model size increases, resulting in more uniform choice distributions. (c) This trend is generalized across four model families, showing that normalized entropy ($H_{\text{norm}}$) consistently increases with the number of parameters. This demonstrates that larger models are better calibrated and less prone to relying on simplistic choice priors.