Table of Contents
Fetching ...

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

Sharat Bhat, Harshita Khandelwal, Tushar Kataria, Vivek Gupta

TL;DR

MapVerse introduces a real-world, map-based VQA benchmark with 11,837 human-authored QA pairs over 1,025 maps across 10 categories to evaluate multimodal geospatial reasoning. The dataset combines diverse map types, geographic granularity, and question formats, enabling fine-grained analysis of visual grounding and spatial reasoning in state-of-the-art VLMs. Comprehensive evaluations reveal that current models excel on simple visual tasks but struggle with complex geospatial reasoning, especially at finer geographic scales and for multi-step inferences; ablation studies show robustness to Gaussian noise but sensitivity to pepper noise and resolution changes. By providing rigorous baselines, rich metadata, and extensive supplementary material, MapVerse offers a durable testbed to advance robust, real-world map understanding and multimodal geospatial reasoning.

Abstract

Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

TL;DR

MapVerse introduces a real-world, map-based VQA benchmark with 11,837 human-authored QA pairs over 1,025 maps across 10 categories to evaluate multimodal geospatial reasoning. The dataset combines diverse map types, geographic granularity, and question formats, enabling fine-grained analysis of visual grounding and spatial reasoning in state-of-the-art VLMs. Comprehensive evaluations reveal that current models excel on simple visual tasks but struggle with complex geospatial reasoning, especially at finer geographic scales and for multi-step inferences; ablation studies show robustness to Gaussian noise but sensitivity to pepper noise and resolution changes. By providing rigorous baselines, rich metadata, and extensive supplementary material, MapVerse offers a durable testbed to advance robust, real-world map understanding and multimodal geospatial reasoning.

Abstract

Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.
Paper Structure (28 sections, 1 equation, 10 figures, 13 tables)

This paper contains 28 sections, 1 equation, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Sample in MapVerse. A rainfall distribution map of Vietnam with a sample QA pair, showing the correct (manually annotated) answer and the predicted answer from a VLM model.
  • Figure 2: t-SNE plots of MapVerse for different Answer Formats and Map Categories. We used ResNet-18 encoder to obtain the latent representation of the images for different map categories and used Sentence Transformer to obtain the latent representation of the questions belonging to different answer formats. t-SNE plots were computed with a perplexity of 800 to maximize inter-class separation.
  • Figure S1: T-SNE for images geographical granularity.
  • Figure S2: Correlation heatmaps showing the proportional relationship between map type (X-axis) and geographic level (Y-axis)
  • Figure S3: Sample QA for mixed map type (isopleth + network)
  • ...and 5 more figures