Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu
TL;DR
The paper tackles map-space pathfinding with large vision-language models by introducing MapBench, a dataset of 1649 pixel-based map queries drawn from 100 maps, and MSSG, a graph-based representation of map structure that encodes landmarks, intersections, and connectivity. It provides conversion tools between MSSG and natural language to enable structured reasoning and evaluation of LVLMs under zero-shot and Chain-of-Thought prompting. Experimental results show that current LVLMs struggle with map-space navigation, with CoT prompting offering improvements but at times adding redundant information and exposing gaps in spatial reasoning and planning. By releasing MapBench and MSSG tooling, the work lays a foundation for more robust map-based multimodal reasoning and long-horizon planning in real-world navigation tasks.
Abstract
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.
