ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Xingjian Tao; Yiwei Wang; Yujun Cai; Yifan Song; Jing Tang

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang

TL;DR

This work presents ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering, and trains ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior.

Abstract

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

TL;DR

Abstract

Paper Structure (30 sections, 7 equations, 4 figures, 3 tables)

This paper contains 30 sections, 7 equations, 4 figures, 3 tables.

Introduction
Related Work
Reinforcement Learning for MultiModal Large Language Models Reasoning
Spatial reasoning with MLLMs
ViewFusion
Limitations of Reasoning Models under Multi-View Inputs
Training Data Preparation
SFT data (18K).
RL data (16K).
Training Strategy
Preliminary: SFT and GRPO
Supervised Fine-Tuning (SFT).
Group Relative Policy Optimization (GRPO).
Two-Stage Optimization
Reward Design for RL
...and 15 more sections

Figures (4)

Figure 1: An example of multi-view spatial reasoning. Given two images captured from different viewpoints, the model must align shared visual cues across views to infer the viewpoint relationship and answer a direction-based question (e.g., locating the picture frame relative to the piano)
Figure 2: Overview of ViewFusion for multi-view spatial reasoning. Given a multi-view question (left), existing “describe-first” or direct “think-and-answer” paradigms often produce view-local descriptions and then shortcut to answering without establishing correct cross-view spatial relations, leading to errors (top). ViewFusion instead performs explicit multi-view spatial pre-thinking to link perspectives and infer viewpoint transformations across images before question solving (bottom), yielding more reliable reasoning and correct predictions.
Figure 3: Qualitative examples on MMSI-Bench. The red boxes highlight the same visual elements observed from different viewpoints across the two images. Compared with Qwen3-VL-4B-Instruct,ViewFusion better aligns cross-view correspondences and infers the underlying viewpoint change, leading to correct answers.
Figure 4: Training curves during GRPO over 1500 steps, including the total reward (left), the accuracy reward (second), the format reward (third), and the KL divergence to the reference policy (right).

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

TL;DR

Abstract

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)