Table of Contents
Fetching ...

FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, Hadi Askari, Nan Xu, Muhao Chen, Yao-Yi Chiang

TL;DR

FRIEDA introduces a rigorous, real-document benchmark for multi-map, multi-step cartographic reasoning in vision-language models, focusing on topological, metric, and directional spatial relations, plus map-element interpretation. It comprises 500 free-form questions across 210 documents and 17,030 map images, with both direct and contextual settings that require within-document retrieval. Across 11 LVLMs, including proprietary and open-source models, performance remains well below human levels (≈38% vs 84%), with dominant error modes including legend misreadings and cross-map misalignment, and only limited gains from explicit reasoning. The work demonstrates the need for models that incorporate cartographic priors and cross-image reasoning, and the authors provide dataset construction details, validation procedures, and plans to release code and prompts to accelerate progress in spatial intelligence for LVLMs.

Abstract

Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.

FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

TL;DR

FRIEDA introduces a rigorous, real-document benchmark for multi-map, multi-step cartographic reasoning in vision-language models, focusing on topological, metric, and directional spatial relations, plus map-element interpretation. It comprises 500 free-form questions across 210 documents and 17,030 map images, with both direct and contextual settings that require within-document retrieval. Across 11 LVLMs, including proprietary and open-source models, performance remains well below human levels (≈38% vs 84%), with dominant error modes including legend misreadings and cross-map misalignment, and only limited gains from explicit reasoning. The work demonstrates the need for models that incorporate cartographic priors and cross-image reasoning, and the authors provide dataset construction details, validation procedures, and plans to release code and prompts to accelerate progress in spatial intelligence for LVLMs.

Abstract

Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.

Paper Structure

This paper contains 71 sections, 21 figures, 17 tables.

Figures (21)

  • Figure 1: Example of a FRIEDA question requiring multi-map, multi-step cartographic reasoning. To solve the question, the model must (1) use each legend to locate the two referenced regions, (2) evaluate the border spatial relation between them, and (3) read the map label of the qualifying feature to answer "Kinsinger Farm."
  • Figure 1: Key statistics.
  • Figure 2: Question distribution by spatial relation (inner) and map count (outer). Sizes are proportional to the number of questions in each category.
  • Figure 3: Overall accuracy of different models on the FRIEDA-direct benchmark.
  • Figure 4: Per spatial relation accuracy (%) of human annotators and three proprietary LVLMs (Gemini-2.5-Pro, Claude-Sonnet-4, and GPT-5-Think) on FRIEDA-direct.
  • ...and 16 more figures