Table of Contents
Fetching ...

Can Large Vision Language Models Read Maps Like a Human?

Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, Zhengzhong Tu

TL;DR

The paper tackles map-space pathfinding with large vision-language models by introducing MapBench, a dataset of 1649 pixel-based map queries drawn from 100 maps, and MSSG, a graph-based representation of map structure that encodes landmarks, intersections, and connectivity. It provides conversion tools between MSSG and natural language to enable structured reasoning and evaluation of LVLMs under zero-shot and Chain-of-Thought prompting. Experimental results show that current LVLMs struggle with map-space navigation, with CoT prompting offering improvements but at times adding redundant information and exposing gaps in spatial reasoning and planning. By releasing MapBench and MSSG tooling, the work lays a foundation for more robust map-based multimodal reasoning and long-horizon planning in real-world navigation tasks.

Abstract

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

Can Large Vision Language Models Read Maps Like a Human?

TL;DR

The paper tackles map-space pathfinding with large vision-language models by introducing MapBench, a dataset of 1649 pixel-based map queries drawn from 100 maps, and MSSG, a graph-based representation of map structure that encodes landmarks, intersections, and connectivity. It provides conversion tools between MSSG and natural language to enable structured reasoning and evaluation of LVLMs under zero-shot and Chain-of-Thought prompting. Experimental results show that current LVLMs struggle with map-space navigation, with CoT prompting offering improvements but at times adding redundant information and exposing gaps in spatial reasoning and planning. By releasing MapBench and MSSG tooling, the work lays a foundation for more robust map-based multimodal reasoning and long-horizon planning in real-world navigation tasks.

Abstract

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

Paper Structure

This paper contains 30 sections, 9 equations, 4 figures, 6 tables, 3 algorithms.

Figures (4)

  • Figure 1: MapBench is a dataset of over 1600 map space path-finding problems from 100 diverse map images. MapBench evaluates language-based navigation instructions generated by Large Vision-Language Models (LVLMs) with map images with cluttered and potentially occluded visual symbols.
  • Figure 2: Map Space Scene Graph for a human-readable map.
  • Figure 3: Sampled MapBench examples from each scenario. Segmenting the map and navigating based on the query require expert-level spatial reasoning and understanding.
  • Figure 4: Illustration of CoT extension to VLMs.

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6