Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Qiyao Xue; Weichen Liu; Shiqi Wang; Haoming Wang; Yuyang Wu; Wei Gao

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu, Wei Gao

TL;DR

The paper addresses the challenge of cross-view spatial reasoning in vision-language models by introducing ReMindView-Bench, a cognitively grounded benchmark with over 50k VQA pairs designed to isolate multi-view spatial reasoning from single-view perception and temporal factors.It combines cognitively informed benchmark design (object-centric vs view-centric representations, schema-based memory, perspective taking) with a dual-analysis framework that includes explicit reasoning path evaluation using LLMs as judges and self-consistency prompts, plus implicit latent-representation probing via linear probing and entropy dynamics.Empirical results reveal that current VLMs excel at in-frame perception but struggle with cross-view integration, cross-frame reasoning, and calibration, with performance dropping as scene clutter and viewpoint transformations increase; larger models show improved phase stability but still exhibit systematic weaknesses in cross-view coherence.The work provides a comprehensive, publicly available benchmark and analysis toolkit to diagnose and guide improvements in spatial reasoning for multi-view vision-language systems.

Abstract

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

TL;DR

Abstract

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (37)