When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Muku Akasaka; Soyeon Caren Han

When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Muku Akasaka, Soyeon Caren Han

TL;DR

A hypothesis-driven analysis of information injection for VSR is conducted across three representative VLMs and two public benchmarks, revealing a consistent pattern: more information does not necessarily yield better reasoning.

Abstract

Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.

When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

TL;DR

Abstract

Paper Structure (11 sections, 6 figures)

This paper contains 11 sections, 6 figures.

Introduction
Hypotheses
Intervention Setup: Injected Information
Evaluation Setup
Results
Overall Performance
Analysis of Spatial Context
Effect of Number of Spatial Contexts
Commonsense Knowledge Analysis
Impact of Chain-of-Thought Reasoning
Conclusion

Figures (6)

Figure 1: Overview of the controlled input intervention setup. We treat VLMs as black-box systems and vary only the injected information: (1) Spatial Contexts (SC), (2) Commonsense Knowledge (CK) retrieved from a knowledge base, and (3) Spatial Reasoning instructions (SR).
Figure 2: Overall performance comparison between models on VSR and EmbSpatial. The accuracy difference is calculated from the zero-shot performance. OA, SC, CK, and SR stand for Orient Anything, Spatial Contexts, Commonsense Knowledge, and Spatial Reasoning Instructions, respectively.
Figure 3: Spatial context overall accuracy (%). Prompt types are abbreviated as follows (BB: bounding box, LC: lateral context, VC: vertical context, OV: orientation angle, OC: orientation context, DV: metric depth value, DC: depth context, OvC: overlap context, and RC: relative size context).
Figure 4: Performance trend of each model over different numbers of spatial contexts.
Figure 5: Performance trend of each model over different thresholds of commonsense knowledge similarity on VSR.
...and 1 more figures

When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

TL;DR

Abstract

When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)