Table of Contents
Fetching ...

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You

TL;DR

V-ReasonBench introduces a unified, reasoning-centric benchmark for evaluating video generation models via the Chain-of-Frame paradigm. It segments reasoning into four dimensions—Structured Problem-Solving, Spatial Cognition, Pattern-based Inference, and Physical Dynamics—using last-frame evaluation complemented by mask-, grid-, and lightweight VLM-based judgments to enable scalable pass@k scoring. The dataset comprises 326 reasoning instances (652 images) with ~9,780 generated videos, evaluated across six state-of-the-art models, revealing dimension-specific strengths, distinct failure modes, and a notable alignment between automatic and human judgments (about 97%). The study highlights that temporal modeling benefits dynamic and physical tasks but can induce visual hallucinations and process-level deviations, underscoring the need for structure-preserving synthesis and future work to bridge reasoning gaps in video generation for human-aligned performance.

Abstract

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

TL;DR

V-ReasonBench introduces a unified, reasoning-centric benchmark for evaluating video generation models via the Chain-of-Frame paradigm. It segments reasoning into four dimensions—Structured Problem-Solving, Spatial Cognition, Pattern-based Inference, and Physical Dynamics—using last-frame evaluation complemented by mask-, grid-, and lightweight VLM-based judgments to enable scalable pass@k scoring. The dataset comprises 326 reasoning instances (652 images) with ~9,780 generated videos, evaluated across six state-of-the-art models, revealing dimension-specific strengths, distinct failure modes, and a notable alignment between automatic and human judgments (about 97%). The study highlights that temporal modeling benefits dynamic and physical tasks but can induce visual hallucinations and process-level deviations, underscoring the need for structure-preserving synthesis and future work to bridge reasoning gaps in video generation for human-aligned performance.

Abstract

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

Paper Structure

This paper contains 40 sections, 29 figures, 3 tables.

Figures (29)

  • Figure 1: Evaluation of Video Generation Models on V-ReasonBench. The performance of six video generation models across the four core reasoning dimensions is illustrated. Detailed numerical results are provided in Tab. \ref{['tab:model_dimension']}.
  • Figure 2: Overview of V-ReasonBench pipeline. The benchmark covers four reasoning dimensions, integrates both synthetic and real-world scenarios, and supports reproducible, large-scale evaluation of video reasoning capabilities.
  • Figure 3: Example failure case from Sequence Completion task illustrating the limitations of VLM-based automatic evaluation. Although the underlying rule is simple, the VLM incorrectly assesses the model’s output due to difficulties in recognizing small grid cells and fine structural differences. More examples are given in Appendix \ref{['appendix:vlm']}.
  • Figure 4: Human–alignment validation of our benchmark’s scoring pipeline. Each point compares binary Pass/Unpass decisions from the automatic evaluation with human judgments across four reasoning categories.
  • Figure 5: Example from the Seedance-1.0-Lite model on the horizontal visual symmetry task. The model introduces additional decorative patterns across the mirrored axis, illustrating its tendency to enrich visual appearance rather than preserve original geometric form.
  • ...and 24 more figures