Table of Contents
Fetching ...

TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, Ivan Vulić

TL;DR

This work probes the spatial reasoning capabilities of vision-language models from a top-view perspective, a setting common in maps and floor plans. It introduces TopViewRS, a dataset of 11,384 multiple-choice questions across four tasks and nine sub-tasks, using both realistic and semantic top-view maps to dissect perception and reasoning. Evaluating 10 VLMs reveals large gaps to human performance, with recognition outperforming localization and relational reasoning; Chain-of-Thought prompting provides a modest but meaningful boost (~5.8%), yet overall top-view spatial reasoning remains limited. The study establishes a controlled benchmark and baseline that motivate further research into improving VLMs' top-view spatial understanding for real-world multimodal tasks.

Abstract

Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.

TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

TL;DR

This work probes the spatial reasoning capabilities of vision-language models from a top-view perspective, a setting common in maps and floor plans. It introduces TopViewRS, a dataset of 11,384 multiple-choice questions across four tasks and nine sub-tasks, using both realistic and semantic top-view maps to dissect perception and reasoning. Evaluating 10 VLMs reveals large gaps to human performance, with recognition outperforming localization and relational reasoning; Chain-of-Thought prompting provides a modest but meaningful boost (~5.8%), yet overall top-view spatial reasoning remains limited. The study establishes a controlled benchmark and baseline that motivate further research into improving VLMs' top-view spatial understanding for real-world multimodal tasks.

Abstract

Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.
Paper Structure (27 sections, 3 equations, 4 figures, 15 tables)

This paper contains 27 sections, 3 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Illustration of the four evaluation tasks with an incremental level of complexity on the two types of top-view maps (photo-realistic versus semantic maps), covering top-view perception and spatial reasoning abilities, with 9 sub-tasks in total (red font), focusing on different, well-defined VLM abilities. The radar graphs (top right) compare the representative models' performance on all sub-tasks, indicating a large gap with human performance.
  • Figure 2: TopViewRS data statistics, showing distribution of task sizes, objects, regions, spatial and relative spatial descriptions in realistic and semantic map settings, where the tasks are described with their initials for visualization.
  • Figure 3: Visualization of fine-grained comparison with 10 models and human on 9 sub-tasks using realistic and semantic top-view maps, demonstrating that most current models perform on par with random baseline in spatial reasoning and has a large gap with human performance. Exact numbers are reported in Table \ref{['apptab:fine-grained_results']} in the Appendix.
  • Figure 4: Dataset Statistical Analysis