Table of Contents
Fetching ...

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, Yuzhang Shang

TL;DR

This work tackles the gap between symbolic reasoning and continuous world dynamics by introducing Gen-ViRe, a Generative Visual Reasoning Benchmark that evaluates Chain-of-Frames reasoning through a six-dimension taxonomy and 24 subtasks. It combines multi-source data, minimal prompting, and a hybrid Vision-Language Model evaluation to quantify how well state-of-the-art video generation models perform as world simulators. Through large-scale experiments on seven SOTA models, Gen-ViRe reveals a consistent gap between impressive visual fidelity and genuine multi-step reasoning, providing baselines and diagnostic tools to steer future development toward truly reasoning, physics-consistent video generation. The benchmark offers a principled framework for diagnosing perception, planning, and abstract reasoning deficits, with practical implications for embodied AI and autonomous systems.

Abstract

While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

TL;DR

This work tackles the gap between symbolic reasoning and continuous world dynamics by introducing Gen-ViRe, a Generative Visual Reasoning Benchmark that evaluates Chain-of-Frames reasoning through a six-dimension taxonomy and 24 subtasks. It combines multi-source data, minimal prompting, and a hybrid Vision-Language Model evaluation to quantify how well state-of-the-art video generation models perform as world simulators. Through large-scale experiments on seven SOTA models, Gen-ViRe reveals a consistent gap between impressive visual fidelity and genuine multi-step reasoning, providing baselines and diagnostic tools to steer future development toward truly reasoning, physics-consistent video generation. The benchmark offers a principled framework for diagnosing perception, planning, and abstract reasoning deficits, with practical implications for embodied AI and autonomous systems.

Abstract

While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.

Paper Structure

This paper contains 18 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: A comparison of reasoning approaches for a maze-solving task. Humans visualize the path via mental simulation. A Multimodal Large Language Model (MLLM) uses symbolic reasoning (CoT) to describe the path, e.g., via coordinates. In contrast, a Video Generation Model (VGM) uses generative visual reasoning (CoF) to physically simulate the process, generating frames of the square moving from start to finish.
  • Figure 2: Our Gen-ViRe evaluates six core cognitive dimensions: (1) Perceptual, (2) Analogical, (3) Abstract, (4) Planning, (5) Spatial & Temporal, and (6) Algorithmic & Logical, with each dimension comprising four different sub-categories.
  • Figure 3: Qualitative examples of Gen-ViRe tasks. It illustrates sample inputs and their expected Chain-of-Frames (CoF) visual reasoning outputs across the six cognitive dimensions, highlighting the benchmark's breadth from foundational perception to high-order planning.
  • Figure 4: The evaluation framework of Gen-ViRe. (a) Data Curation: Shows the benchmark development process, including defining the taxonomy, collecting data from multiple channels (web, existing datasets, AI generation), and designing & validating prompts through Peer Review. (b) Formulation of Evaluation Criteria: Demonstrates the process of formulating detailed, multi-dimensional evaluation criteria (as shown by C1-C5 in the figure) for each prompt of every subtask. (c) VLM-based Autorating Framework: Illustrates how the VLM (Autorater) conducts item-by-item analysis and automatic scoring of the generated videos based on the specific criteria defined in (b).
  • Figure 5: Left: The main chart compares the overall performance of the 7 state-of-the-art models across the six core cognitive dimensions (Abstract, Algorithmic, Analogy, Perceptual, Planning, and Spatial Reasoning). Right: The six sub-charts provide a detailed performance breakdown for the individual subtasks within each dimension. The legend (bottom) links each colored line to its respective model.
  • ...and 4 more figures