Table of Contents
Fetching ...

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

TL;DR

This work introduces 3ViewSense, a framework that grounds spatial reasoning in Orthographic Views, and proposes a ``Simulate-and-Reason''mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities.

Abstract

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

TL;DR

This work introduces 3ViewSense, a framework that grounds spatial reasoning in Orthographic Views, and proposes a ``Simulate-and-Reason''mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities.

Abstract

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.
Paper Structure (29 sections, 2 theorems, 10 equations, 13 figures, 10 tables)

This paper contains 29 sections, 2 theorems, 10 equations, 13 figures, 10 tables.

Key Result

Theorem 1.2

A configuration $H$ is uniquely determined by $\mathcal{V}(H)$ if and only if for every $(x,y)$ with $H_{x,y}>0$,

Figures (13)

  • Figure 1: Motivation for explicit three-view reasoning. Providing explicit orthographic three-view descriptions (front/left/top) improves block-counting performance under occlusion, highlighting the role of view-consistent spatial representations.
  • Figure 2: The construction pipeline of our OrthoMind-3D dataset. To bridge the gap between visual perception and mental spatial reasoning, we curate data from two distinct domains. For In-Domain data, we utilize programmatic synthesis with strict geometric constraints to train the model's orthographic projection capabilities. For Out-of-Domain data, we employ sandbox game engines and generative AI techniques to evaluate the model's robustness and generalization in unstructured environments.
  • Figure 3: The training framework of 3ViewSense. Stage I learns to induce canonical front, left, and top orthographic views from an egocentric input. Stage II performs view-grounded reasoning by integrating the inferred views to generate reasoning traces and final answers, with reinforcement learning for refinement.
  • Figure 4: In-context learning (ICL) and explicit orthographic three-view description study on OrthoMind-3D (in-domain). ICL yields limited improvements only for the strongest proprietary models, while explicit three-view descriptions substantially improve performance for most models, supporting the need for a view-consistent intermediate representation.
  • Figure 5: RL ablation on initialization. We compare GRPO reward trajectories when starting RL from the Stage I OMS-SFT model versus from the Stage II VGR-SFT model.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Definition 1.1: Notation
  • Theorem 1.2: Uniqueness
  • proof
  • Corollary 1.3