Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing; Zezhou Sun; Shuangyu Xie; Kaiyuan Chen; Yanjia Huang; Yuping Wang; Jiachen Li; Dezhen Song; Zhengzhong Tu

Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu

TL;DR

The paper tackles map-space pathfinding with large vision-language models by introducing MapBench, a dataset of 1649 pixel-based map queries drawn from 100 maps, and MSSG, a graph-based representation of map structure that encodes landmarks, intersections, and connectivity. It provides conversion tools between MSSG and natural language to enable structured reasoning and evaluation of LVLMs under zero-shot and Chain-of-Thought prompting. Experimental results show that current LVLMs struggle with map-space navigation, with CoT prompting offering improvements but at times adding redundant information and exposing gaps in spatial reasoning and planning. By releasing MapBench and MSSG tooling, the work lays a foundation for more robust map-based multimodal reasoning and long-horizon planning in real-world navigation tasks.

Abstract

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

Can Large Vision Language Models Read Maps Like a Human?

TL;DR

Abstract

Can Large Vision Language Models Read Maps Like a Human?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)

Theorems & Definitions (6)