Table of Contents
Fetching ...

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan

TL;DR

MindJourney presents a plug-and-play approach that endows vision–language models with a controllable world model for imaginative 3D reasoning at test time. By coupling a pose-conditioned video diffusion model with Spatial Beam Search, the framework generates informative egocentric views along imagined trajectories and uses them as evidence to answer spatial questions without any training. Across SAT-Synthesized and SAT-Real benchmarks, MindJourney yields consistent, significant gains across multiple VLM backends and two world-model generators, demonstrating strong model-agnostic benefits and the complementary potential of world-model-based imagination to RL-based self-improvement. The work highlights the practical impact of test-time scaling for embodied AI, enabling robust 3D reasoning in realistic settings while outlining avenues for extending to multi-source inputs and query-conditioned world models.

Abstract

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

TL;DR

MindJourney presents a plug-and-play approach that endows vision–language models with a controllable world model for imaginative 3D reasoning at test time. By coupling a pose-conditioned video diffusion model with Spatial Beam Search, the framework generates informative egocentric views along imagined trajectories and uses them as evidence to answer spatial questions without any training. Across SAT-Synthesized and SAT-Real benchmarks, MindJourney yields consistent, significant gains across multiple VLM backends and two world-model generators, demonstrating strong model-agnostic benefits and the complementary potential of world-model-based imagination to RL-based self-improvement. The work highlights the practical impact of test-time scaling for embodied AI, enabling robust 3D reasoning in realistic settings while outlining avenues for extending to multi-source inputs and query-conditioned world models.

Abstract

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

Paper Structure

This paper contains 48 sections, 8 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: MindJourney. Given a spatial reasoning query, our method searches through the imagined 3D space through a world model and improves VLM's spatial understanding through generated observations during test-time.
  • Figure 2: MindJourney Pipeline. Our pipeline starts with Spatial Beam Search for $n$ steps before answering the question. The world model interactively generates new observations, while a VLM constructs the evidence buffer for Q&A and guides the search during the process.
  • Figure 3: Trajectory Expansion Illustration. The Figure illustrate a Trajectory Expansion process with $k = 3$, $d = 0.25$, and $\theta=10^\circ$. In this case, the world model generates 9 new observations given the Beam Node.
  • Figure 4: Inference Steps vs. Accuracy. Accuracy on SAT-Real and SAT-Synthesized with different VLM Thresholds and inference steps.
  • Figure 5: Trajectory Expansion example on SAT-Real.
  • ...and 9 more figures