Table of Contents
Fetching ...

MapTab: Can MLLMs Master Constrained Route Planning?

Ziqiao Shang, Lingyue Ge, Yang Chen, Shi-Yu Tian, Zhenyu Huang, Wenbo Fu, Yu-Feng Li, Lan-Zhe Guo

TL;DR

MapTab addresses the need for rigorous evaluation of constrained reasoning in Multimodal LLMs by integrating vision-grounded map inputs with structured tabular data over two real-world topologies, Metromap and Travelmap. It defines a formal constrained route-planning task, builds a comprehensive five-step data pipeline, and delivers 328 maps with 196,800 RP queries and 3,936 QA queries across 15 MLLMs. The benchmark reveals persistent challenges in perception, cross-modal integration, and multi-step reasoning, with results indicating that current models struggle under dense visuals and complex constraint settings, though structured tables can anchor grounding. The work offers a realistic testbed for advancing MLLM evaluation and highlights directions toward dynamic, real-time, and more complex map-based reasoning systems with practical implications for geospatial AI tools.

Abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their constrained reasoning capabilities. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate constrained reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key constraints: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in constrained multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs.

MapTab: Can MLLMs Master Constrained Route Planning?

TL;DR

MapTab addresses the need for rigorous evaluation of constrained reasoning in Multimodal LLMs by integrating vision-grounded map inputs with structured tabular data over two real-world topologies, Metromap and Travelmap. It defines a formal constrained route-planning task, builds a comprehensive five-step data pipeline, and delivers 328 maps with 196,800 RP queries and 3,936 QA queries across 15 MLLMs. The benchmark reveals persistent challenges in perception, cross-modal integration, and multi-step reasoning, with results indicating that current models struggle under dense visuals and complex constraint settings, though structured tables can anchor grounding. The work offers a realistic testbed for advancing MLLM evaluation and highlights directions toward dynamic, real-time, and more complex map-based reasoning systems with practical implications for geospatial AI tools.

Abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their constrained reasoning capabilities. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate constrained reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key constraints: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in constrained multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs.
Paper Structure (58 sections, 6 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 58 sections, 6 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Composition and Statistical Overview of the MapTab Benchmark. MapTab features 328 high-resolution maps across Metromap and Travelmap scenarios, providing 196,800 RP queries and 3,936 QA queries.
  • Figure 2: Schematic overview of the MapTab construction pipeline, comprising 5 main steps: Image Collection & Preprocessing, Tabular Construction, Quality Control, Query Generation, and Label Annotation.
  • Figure 3: Impact of image resolution on Route Planning (RP) tasks in (a) Metromap and (b) Travelmap scenarios. Images are downsampled to ratios of 1/2, 1/4, 1/8, 1/16, and 1/32. Performance is evaluated under four map-based input settings: Map-Only (M), Map+Edge_tab (M+E_tab), Map+Edge_tab+Vertex_tab (M+E_tab+V_tab), and Map+Vertex2_tab (M+V2_tab).
  • Figure 4: Distribution of model accuracy across Map Difficulty and Query Difficulty under different input modalities (Metromap and Travelmap scenarios).
  • Figure 5: Model accuracy matrix under different combinations of Map Difficulty and Query Difficulty. The first row shows results in the Metromap scenario, while the second row corresponds to the Travelmap scenario. M denotes Map Only; E denotes Edge Table Only; M+E denotes Map + Edge Table; M+E+V denotes Map + Edge Table + Vertex Table; and M+V2 denotes Map + Vertex2_tab.
  • ...and 8 more figures