Table of Contents
Fetching ...

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen

TL;DR

TiViBench introduces a hierarchical, four-dimension benchmark to quantify reasoning in image-to-video generation across 24 tasks and three difficulty levels, addressing a gap in evaluating higher-order visual reasoning. It pairs TiViBench with VideoTPO, a test-time preference-optimization strategy that leverages self-analysis from a language model to refine prompts without retraining. The study finds commercial I2V systems exhibit stronger, more consistent reasoning, while open-source systems show latent potential that improves with scale and data; VideoTPO reliably boosts reasoning across tasks. Together, TiViBench and VideoTPO lay a foundation for evaluating and driving progress in reasoning for video generation models.

Abstract

The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

TL;DR

TiViBench introduces a hierarchical, four-dimension benchmark to quantify reasoning in image-to-video generation across 24 tasks and three difficulty levels, addressing a gap in evaluating higher-order visual reasoning. It pairs TiViBench with VideoTPO, a test-time preference-optimization strategy that leverages self-analysis from a language model to refine prompts without retraining. The study finds commercial I2V systems exhibit stronger, more consistent reasoning, while open-source systems show latent potential that improves with scale and data; VideoTPO reliably boosts reasoning across tasks. Together, TiViBench and VideoTPO lay a foundation for evaluating and driving progress in reasoning for video generation models.

Abstract

The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.

Paper Structure

This paper contains 66 sections, 3 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Pass@$1$ performance overview on TiViBench across $24$ tasks within $4$ dimensions.
  • Figure 2: (Left) Language models have evolved from basic understanding tasks to advanced reasoning capabilities. (Middle) Can video generative models exhibit reasoning capabilities comparable to those of LLMs? (Right) Existing I2V benchmarks focus on general generation capabilities (e.g., spatial fidelity, temporal smoothness), while our TiViBench complements these by introducing a reasoning-oriented benchmark, enabling comprehensive evaluation across both general and reasoning abilities.
  • Figure 3: Overview of TiViBench. TiViBench represents an image-to-video (I2V) benchmark tailored to comprehensively evaluate the emerging visual reasoning capabilities across four key categories: (1st) Structural Reasoning & Search, (2nd) Spatial & Visual Pattern Reasoning, (3rd) Symbolic & Logical Reasoning, and (4th) Action Planning & Task Execution. Each category encompasses six diverse tasks to challenge video generative models to perform complex reasoning beyond general generation.
  • Figure 4: Overview of our proposed (Left) TiViBench benchmark and (Right) VideoTPO framework.
  • Figure 5: Overview of TiViBench's statistical distributions. (Left) Word distribution of prompt suites; (Middle) Data distribution across $24$ tasks; and (Right) Data distribution across $3$ difficulty levels.
  • ...and 17 more figures