Table of Contents
Fetching ...

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

TL;DR

This work addresses the unreliability of visual spatial reasoning when static observations are insufficient by enabling adaptive test-time imagination. It introduces AVIC, a gating-policy and trajectory-verification framework that selectively invokes and scales world-imagination only when it is likely to be informative, and that plans targeted imagined viewpoints aligned with hypothesized actions. Through analysis of always-on imagination and extensive experiments on SAT-Real, MMSI-Bench, and R2R, AVIC demonstrates that a small, well-timed amount of imagination yields significant gains while greatly reducing computational cost, with the largest benefits for action-conditioned spatial reasoning. The findings emphasize that test-time imagination should be instance-dependent and uncertainty-aware to achieve efficient and reliable visual spatial reasoning in multimodal systems and embodied tasks.

Abstract

Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

TL;DR

This work addresses the unreliability of visual spatial reasoning when static observations are insufficient by enabling adaptive test-time imagination. It introduces AVIC, a gating-policy and trajectory-verification framework that selectively invokes and scales world-imagination only when it is likely to be informative, and that plans targeted imagined viewpoints aligned with hypothesized actions. Through analysis of always-on imagination and extensive experiments on SAT-Real, MMSI-Bench, and R2R, AVIC demonstrates that a small, well-timed amount of imagination yields significant gains while greatly reducing computational cost, with the largest benefits for action-conditioned spatial reasoning. The findings emphasize that test-time imagination should be instance-dependent and uncertainty-aware to achieve efficient and reliable visual spatial reasoning in multimodal systems and embodied tasks.

Abstract

Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.
Paper Structure (26 sections, 6 equations, 5 figures, 9 tables)

This paper contains 26 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Different cases in always-on visual imagination. Imagined views are generated independently for different beam-searched actions (shown by multiple arrows). Case 1 (Helpful): Visual imagination reveals previously unseen viewpoints, enabling helpful spatial reasoning. Case 2 (Misleading): Imagination fails to preserve task-relevant objects (e.g., the white table in the red box), resulting in incorrect spatial inference and wrong answers. Case 3 (Unnecessary): The required information is already clearly observable in the original view (e.g., the bathtub in the blue box), making additional imagined views redundant.
  • Figure 2: (a): In the majority of cases, visual imagination is unnecessary, while a smaller fraction is helpful or misleading, highlighting the need for selective invocation rather than uniform use. (b): Accuracy gain over the baseline over the number of imagined views. Performance improvements are non-monotonic, indicating that additional imagination does not consistently translate to better reasoning and may even degrade accuracy when there are too many generated views. (c): Accuracy versus average token usage. Bubble size indicates average running time. Fixed imagination strategies achieve higher accuracy at the cost of substantially increased computation, motivating adaptive test-time scaling that balances performance and efficiency.
  • Figure 3: Comparison with other methods. (a) Answers directly from the current observation without any imagination. (b) Always invokes the world model w. full exploration to generate imagined views for downstream reasoning. (c) Ours: Uses a policy model to first decide whether visual imagination is necessary and to plan actions accordingly. It selectively queries the world model (both when and how much) and otherwise performs direct reasoning.
  • Figure 4: Analysis of when and how much to invoke world-model imagination.
  • Figure 5: Qualitative examples on SAT of the always-on imagination method and our adaptive method, as well as the R2R navigation task. In the navigation example, the green option is selected by the model with adaptive imagination via our method, while the red one is without world model imagination.