Table of Contents
Fetching ...

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song

TL;DR

A method called recursive scene graph assisted reasoning is proposed, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods.

Abstract

Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

TL;DR

A method called recursive scene graph assisted reasoning is proposed, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods.

Abstract

Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.
Paper Structure (35 sections, 8 figures, 8 tables)

This paper contains 35 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Common VQA hudson2019gqa typically involve recognizing visual content and factual knowledge, while common logical reasoning yue2024mmmu focuses on abstract, symbolic problem-solving. Spatial logical reasoning, in contrast, requires integrating both spatial understanding and multi-step logical reasoning to accomplish tasks in real-world scenes.
  • Figure 2: Prompt template and examples of several indoor scenes.
  • Figure 3: The distributions of answer step counts, scene categories, and partial object categories in SpatiaLQA. The x-axes of the three plots represent the number of answer steps, indoor scene categories, and object categories, while the y-axes indicate the number of samples.
  • Figure 4: The data collection pipeline for SpatiaLQA. Note that although the graph expansion augmentation in the figure is applied only to the data from subgraph extraction augmentation, we actually also applied graph expansion augmentation to the manually annotated data.
  • Figure 5: The matching process between the predicted and annotated steps. We first use GPT-4o to match the predicted steps and annotated steps in pairs based on the image (allowing one-to-many matches), resulting in a matching matrix. Then, we apply the Hungarian algorithm to filter the matching matrix, removing redundant matches to achieve the maximum one-to-one matches.
  • ...and 3 more figures