Table of Contents
Fetching ...

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang

TL;DR

This work presents ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering, and trains ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior.

Abstract

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

TL;DR

This work presents ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering, and trains ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior.

Abstract

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.
Paper Structure (30 sections, 7 equations, 4 figures, 3 tables)

This paper contains 30 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example of multi-view spatial reasoning. Given two images captured from different viewpoints, the model must align shared visual cues across views to infer the viewpoint relationship and answer a direction-based question (e.g., locating the picture frame relative to the piano)
  • Figure 2: Overview of ViewFusion for multi-view spatial reasoning. Given a multi-view question (left), existing “describe-first” or direct “think-and-answer” paradigms often produce view-local descriptions and then shortcut to answering without establishing correct cross-view spatial relations, leading to errors (top). ViewFusion instead performs explicit multi-view spatial pre-thinking to link perspectives and infer viewpoint transformations across images before question solving (bottom), yielding more reliable reasoning and correct predictions.
  • Figure 3: Qualitative examples on MMSI-Bench. The red boxes highlight the same visual elements observed from different viewpoints across the two images. Compared with Qwen3-VL-4B-Instruct,ViewFusion better aligns cross-view correspondences and infers the underlying viewpoint change, leading to correct answers.
  • Figure 4: Training curves during GRPO over 1500 steps, including the total reward (left), the accuracy reward (second), the format reward (third), and the KL divergence to the reference policy (right).