EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Zelin Xu; Yupu Zhang; Saugat Adhikari; Saiful Islam; Tingsong Xiao; Zibo Liu; Shigang Chen; Da Yan; Zhe Jiang

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, Zhe Jiang

TL;DR

The proposed EarthSpatialBench is a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery that contains over 325K question-answer pairs spanning qualitative and quantitative reasoning about spatial distance and direction, and systematic topological relations.

Abstract

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 5 tables)

This paper contains 15 sections, 6 figures, 5 tables.

Introduction
Relevant Works
Spatial Reasoning Benchmarks for Natural Images
Multimodal Large Language Models
EarthSpatialBench
Evaluation Dimensions
Dataset Construction
Dataset Statistics
Experiment
Experimental Setup
Main Results
Grounding as a Prerequisite for Spatial Reasoning
Comparisons on Different Geometric Representations
Comparisons on Different Geometric Types
Conclusion and Future Work

Figures (6)

Figure 1: Overview of EarthSpatialBench. The benchmark spans diverse spatial relations evaluated over multiple geometry types, object representation modalities, and question formats, enabling comprehensive assessment of spatial reasoning. High-resolution Earth imagery and geometric annotations are collected from SatlasPretrain. Spatial relations (distance, direction, topology) are computed using standard GIS procedures and combined with curated question templates to generate diverse reasoning tasks. A multi-stage quality control process ensures geometric accuracy, semantic clarity, and reliable ground-truth annotations.
Figure 2: (a) Distribution of all 23 geometric object types in EarthSpatialBench (log scale). The dataset includes 6 small-object bounding-box classes (building, dam, parking lot, pier, power substation, helipad), 7 polyline classes (airport runway, airport taxiway, raceway, railway, river, road, track), and 10 large-region polygon classes (airport, crop, landfill, park, quarry, solar farm, water park, theme park, stadium, power plant). (b) Distribution of the total number of annotated objects per image on a log–log scale, showing the wide variation in scene complexity across images. (c) Distribution of question counts across spatial relation categories (distance, direction, topology) and question formats (choice-based, quantitative, localization).
Figure 3: Performance of MLLMs on four grounding variants.
Figure 4: Performance differences between visual-overlay-based and coordinate-based object references. (a) Change in MAE for counting. (b) Change in F1 score for localization.
Figure 5: MAE of angle estimation under different object representation settings.
...and 1 more figures

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

TL;DR

Abstract

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (6)