Table of Contents
Fetching ...

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, Yuan-Fang Li

TL;DR

A novel VQA dataset is constructed, Spatial-MM, to comprehensively study large Multimodal Models’ spatial understanding and reasoning capabilities and reveals that LMMs are much stronger at basic object detection than complex spatial reasoning.

Abstract

Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: https://github.com/FatemehShiri/Spatial-MM

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

TL;DR

A novel VQA dataset is constructed, Spatial-MM, to comprehensively study large Multimodal Models’ spatial understanding and reasoning capabilities and reveals that LMMs are much stronger at basic object detection than complex spatial reasoning.

Abstract

Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: https://github.com/FatemehShiri/Spatial-MM

Paper Structure

This paper contains 28 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: benchmarking the spatial reasoning capabilities of GPT-4o gpt4 (Date accessed: June 12, 2024). Text in red and green signifies an incorrect and ground-truth answers, respectively. The accuracy of GPT-4o in answering questions related to the human's viewpoint in the image is only 27.5%.
  • Figure 2: VQA examples from our Spatial-MM that encompass a range of challenging visual patterns..
  • Figure 3: An example of generated reasoning steps for a multi-hop question.
  • Figure 4: Examples of caption perturbation.
  • Figure 5: Instances are identified where the spatial reasoning capabilities of GPT-4V gpt4 fall short (Date accessed: June 6, 2024) due to inaccurate spatial understanding. Text in red signifies an incorrect response. All the images referenced are from our Spatial-MM benchmark which are sourced from Internet.
  • ...and 8 more figures