Table of Contents
Fetching ...

Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

Saurav Jha, M. Jehanzeb Mirza, Wei Lin, Shiqi Yang, Sarath Chandar

TL;DR

The paper investigates how test-time verification of world-model rollouts can augment spatial reasoning in Vision-Language Models. It critiques MindJourney's heuristic verifier and introduces Verification through Spatial Assertions (ViSA), a frame-anchored, claim-based verifier that uses an evidence-quality reward. On SAT-Real, ViSA delivers significant accuracy gains and more balanced exploration compared to baselines. However, on MMSI-Bench, all verifiers plateau, revealing an information bottleneck in current world-models and underscoring the need for higher-fidelity world models and task-specific verification strategies for robust multi-view reasoning.

Abstract

Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

TL;DR

The paper investigates how test-time verification of world-model rollouts can augment spatial reasoning in Vision-Language Models. It critiques MindJourney's heuristic verifier and introduces Verification through Spatial Assertions (ViSA), a frame-anchored, claim-based verifier that uses an evidence-quality reward. On SAT-Real, ViSA delivers significant accuracy gains and more balanced exploration compared to baselines. However, on MMSI-Bench, all verifiers plateau, revealing an information bottleneck in current world-models and underscoring the need for higher-fidelity world models and task-specific verification strategies for robust multi-view reasoning.

Abstract

Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

Paper Structure

This paper contains 16 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Average entropy of InternVL3-14B over the answers of 50 randomly sampled questions from SAT-Real.
  • Figure 2: Illustration of the pipelines for accumulating the evidence buffers (marked by 3x3 boxes) in MindJourney (MJ) vs. our ViSA. While MJ directly prompts a VLM to score all world model views jointly, ViSA does so at a more granular frame-level where a claim generator VLM is first asked to generate frame-anchored micro-claims about the observable changes due to an action. This is followed by a claim verifier VLM being asked to evaluate these claims using the same anchored frames as evidence. Based on the evaluation results, a test-time reward ($\mathcal{R}^*$) then scores the individual frames. Arrows denote egocentric actions including moving forward ($\uparrow$), turning left ($\longleftarrow$), and turning right ($\longrightarrow$). Colormap:blue, red and green denote increasing order of magnitudes for each action.
  • Figure 3: Action distribution comparison between MindJourney and ViSA (ours) across different top-$k \in \{1,2,3\}$ values. We consider three action types (move forward, turn left, turn right) for each model's top-$k$ configuration.
  • Figure 4: Illustration of claim generation and verification steps within ViSA.
  • Figure 5: Effect of the verifiers' permissiveness (top-$k$) on answer selection confidence of 50 randomly selected SAT-Real questions grouped by (a) overall, (b) correct answers, and (c) wrong answers. The baseline is InternVL3-14B.
  • ...and 2 more figures