Table of Contents
Fetching ...

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu, Wei Gao

TL;DR

The paper addresses the challenge of cross-view spatial reasoning in vision-language models by introducing ReMindView-Bench, a cognitively grounded benchmark with over 50k VQA pairs designed to isolate multi-view spatial reasoning from single-view perception and temporal factors.It combines cognitively informed benchmark design (object-centric vs view-centric representations, schema-based memory, perspective taking) with a dual-analysis framework that includes explicit reasoning path evaluation using LLMs as judges and self-consistency prompts, plus implicit latent-representation probing via linear probing and entropy dynamics.Empirical results reveal that current VLMs excel at in-frame perception but struggle with cross-view integration, cross-frame reasoning, and calibration, with performance dropping as scene clutter and viewpoint transformations increase; larger models show improved phase stability but still exhibit systematic weaknesses in cross-view coherence.The work provides a comprehensive, publicly available benchmark and analysis toolkit to diagnose and guide improvements in spatial reasoning for multi-view vision-language systems.

Abstract

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

TL;DR

The paper addresses the challenge of cross-view spatial reasoning in vision-language models by introducing ReMindView-Bench, a cognitively grounded benchmark with over 50k VQA pairs designed to isolate multi-view spatial reasoning from single-view perception and temporal factors.It combines cognitively informed benchmark design (object-centric vs view-centric representations, schema-based memory, perspective taking) with a dual-analysis framework that includes explicit reasoning path evaluation using LLMs as judges and self-consistency prompts, plus implicit latent-representation probing via linear probing and entropy dynamics.Empirical results reveal that current VLMs excel at in-frame perception but struggle with cross-view integration, cross-frame reasoning, and calibration, with performance dropping as scene clutter and viewpoint transformations increase; larger models show improved phase stability but still exhibit systematic weaknesses in cross-view coherence.The work provides a comprehensive, publicly available benchmark and analysis toolkit to diagnose and guide improvements in spatial reasoning for multi-view vision-language systems.

Abstract

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

Paper Structure

This paper contains 31 sections, 37 figures, 9 tables.

Figures (37)

  • Figure 1: Left: Current VLMs struggle to maintain coherent spatial reasoning across multiple views (✓ indicates consistent reasoning and × denotes incorrect reasoning or localization). Middle: We assess multi-view spatial reasoning through fine-grained dimensions of diverse viewpoint spatial patterns and query types to capture key cognitive factors in spatial reasoning. Right: To further interpret VLM's successes and failures, we conduct explicit analysis of VLM's textual reasoning path and implicit analysis of VLM's latent token representation.
  • Figure 2: Stages of humans' spatial mental modeling
  • Figure 3: ReMindView-Bench construction pipeline. It first generates diverse indoor scenes with various room types and object densities by adjusting scene constraint parameters in Infinigen. Next, multiple views are rendered in Blender with controlled camera–object distances and spatial patterns. VQA data is produced using predefined query templates, combined with metadata extracted from the scene and views.
  • Figure 4: Query-related label combinations to generate VQA pairs
  • Figure 5: Task accuracy with with different numbers of objects
  • ...and 32 more figures