Table of Contents
Fetching ...

CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding

Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, Yan Liu

TL;DR

CartoMapQA introduces a comprehensive benchmark to probe vision-language models on cartographic map understanding through six hierarchical tasks spanning map feature recognition, scale interpretation, and turn-by-turn navigation. The dataset uses OpenStreetMap-derived maps (2251 questions across 853 maps) and supports zero-shot evaluation of 15 LVLMs, revealing persistent gaps in map semantics, OCR robustness, and geospatial reasoning. Key contributions include novel task design, ground-truth generation via graph-based tools, and a detailed cross-model analysis that exposes concrete failure modes and guides architectural improvements. The work has practical implications for navigation, geographic search, and urban planning, and provides open-source resources to foster further research in map-aware multimodal understanding.

Abstract

The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git

CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding

TL;DR

CartoMapQA introduces a comprehensive benchmark to probe vision-language models on cartographic map understanding through six hierarchical tasks spanning map feature recognition, scale interpretation, and turn-by-turn navigation. The dataset uses OpenStreetMap-derived maps (2251 questions across 853 maps) and supports zero-shot evaluation of 15 LVLMs, revealing persistent gaps in map semantics, OCR robustness, and geospatial reasoning. Key contributions include novel task design, ground-truth generation via graph-based tools, and a detailed cross-model analysis that exposes concrete failure modes and guides architectural improvements. The work has practical implications for navigation, geographic search, and urban planning, and provides open-source resources to foster further research in map-aware multimodal understanding.

Abstract

The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git

Paper Structure

This paper contains 17 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Examples of questions and requests featured in CartoMapQA. Note: these are short versions of the actual questions and requests used in the dataset, which can be found in the implementation repository.
  • Figure 2: Illustration of the process used to generate answer choices for the multiple-choice questions in the mfsem task.
  • Figure 3: Distribution and examples of the main issues identified in $50$% of the incorrect answers from gemini and qwen 2.5 72B in the stmf task with name-listing requests. Note: due space constraints, requests shown are short versions of the actual ones.
  • Figure 4: Analysis from the o3 model in the srnav task.
  • Figure 5: Distribution of route lengths in the rlest task.
  • ...and 7 more figures