Table of Contents
Fetching ...

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang

Abstract

Multimodal large language models (MLLMs) have demonstrated significant progress in semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on more complex tasks involving mathematics and logic. To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. ReasonMap encompasses high-resolution transit maps from 30 cities and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Our comprehensive evaluation of 16 popular MLLMs reveals a counterintuitive pattern: among open-source models, base variants outperform their reasoning-tuned counterparts, whereas the opposite trend is observed in closed-source models. Further analysis under the visual-masking setting confirms that strong performance necessitates direct visual grounding, rather than relying solely on language priors. We further establish a training baseline with reinforcement fine-tuning, providing a reference for future exploration. We hope this benchmark study offers new insights into visual reasoning and helps investigate the gap between open- and closed-source models.

ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps

Abstract

Multimodal large language models (MLLMs) have demonstrated significant progress in semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on more complex tasks involving mathematics and logic. To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. ReasonMap encompasses high-resolution transit maps from 30 cities and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Our comprehensive evaluation of 16 popular MLLMs reveals a counterintuitive pattern: among open-source models, base variants outperform their reasoning-tuned counterparts, whereas the opposite trend is observed in closed-source models. Further analysis under the visual-masking setting confirms that strong performance necessitates direct visual grounding, rather than relying solely on language priors. We further establish a training baseline with reinforcement fine-tuning, providing a reference for future exploration. We hope this benchmark study offers new insights into visual reasoning and helps investigate the gap between open- and closed-source models.

Paper Structure

This paper contains 44 sections, 2 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of ReasonMap. We present a novel benchmark tailored for evaluating the fine-grained visual reasoning capabilities of MLLMs. The dataset comprises $1{,}008$ question-answer pairs derived from high-resolution transit maps across $30$ cities in $13$ countries, featuring diversely structured questions. Further details on the dataset construction pipeline are provided in Section \ref{['sec:dataset']}.
  • Figure 2: The building pipeline of ReasonMap consists of three main stages: (1) data collection and preprocessing, (2) question–answer pair construction, and (3) quality control. Steps (2-4) in the figure correspond to the question–answer pair construction stage.
  • Figure 3: Error case analyses of various MLLMs using ReasonMap. For reasoning models, the reasoning process is explicitly marked with <think> and </think> tags. We highlight error contents in the answers with red and categorize them accordingly.
  • Figure A1: Accuracy across difficulty combinations for four representative MLLMs (Qwen2.5-VL-72B-I, InternVL3-78B, OpenAI o3, and Doubao-415). Each difficulty combination is denoted by a pair (e.g., easy-hard), where the first term indicates question difficulty and the second term represents map difficulty. The pair (hard-middle) contains only one sample, leading to an accuracy of 100%. We summarize the number of evaluation samples in each difficulty bucket: $55$ samples for easy-easy, $46$ for easy-middle, $28$ for middle-easy, $7$ for hard-easy, $23$ for middle-middle, $80$ for easy-hard, $1$ for hard-middle, $57$ for middle-hard, and $15$ for hard-hard.
  • Figure A2: Accuracy across different cities for four representative MLLMs (Qwen2.5-VL-72B-I, InternVL3-78B, OpenAI o3, and Doubao-415). Each city is marked with the corresponding map difficulty and the country flag. Each city in the test set provides a specific number of samples per model: $32$ samples for Auckland, $34$ for Los Angeles, $7$ for Miami, $35$ for Lisboa, $18$ for Geneva, $40$ for Beijing, $39$ for Hangzhou, $17$ for Budapest, $39$ for Singapore, $40$ for Rome, and $11$ for Toronto.
  • ...and 3 more figures