Table of Contents
Fetching ...

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu

TL;DR

RULER-Bench introduces a rule-based reasoning benchmark for video generation, addressing the need to evaluate cognitive capabilities beyond perceptual quality. It formulates a six-category taxonomy (Vision, Science, Semantics, Hypothesis, Game, Humanity) across Nature, Society, and Virtuality, and provides two task paradigms (text-to-video and image-to-video) with 40 tasks and 622 carefully curated instances. An innovative checklist-based evaluation using multimodal LLMs (e.g., GPT-o3) achieves about 85% alignment with human judgments. Across 10 state-of-the-art models, results reveal substantial gaps in rule coherence, especially in I2V tasks, underscoring the promise and difficulty of advancing reasoning-aware video generation toward vision foundation intelligence.

Abstract

Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

TL;DR

RULER-Bench introduces a rule-based reasoning benchmark for video generation, addressing the need to evaluate cognitive capabilities beyond perceptual quality. It formulates a six-category taxonomy (Vision, Science, Semantics, Hypothesis, Game, Humanity) across Nature, Society, and Virtuality, and provides two task paradigms (text-to-video and image-to-video) with 40 tasks and 622 carefully curated instances. An innovative checklist-based evaluation using multimodal LLMs (e.g., GPT-o3) achieves about 85% alignment with human judgments. Across 10 state-of-the-art models, results reveal substantial gaps in rule coherence, especially in I2V tasks, underscoring the promise and difficulty of advancing reasoning-aware video generation toward vision foundation intelligence.

Abstract

Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

Paper Structure

This paper contains 30 sections, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Overview of RULER-Bench. We propose RULER-Bench, a comprehensive benchmark designed to evaluate the rule-based reasoning abilities of video generation models. Left: Grounded in three fundamental domains, we formulate rule-based reasoning ability into six categories: Science, Vision, Hypothesis, Game, Semantics, and Humanity. These categories are further subdivided into 40 tasks. Center: Using the collected samples, we evaluate 10 video models based on the corresponding checklist across four metrics. Each checklist question is scored by GPT-o3 with discrete labels. To validate the reliability of the evaluator, we conduct a human alignment study, in which GPT-o3 achieves 85% agreement with human judgments. Right: Extensive experiments demonstrate that Veo3.1 achieves the best performance. However, all models exhibit limited reasoning ability across different rule categories.
  • Figure 2: Overview of dataset construction and validation. First, we formulate our tasks based on the six rule categories. Second, we design task-specific data construction pipelines for T2V and I2V tasks. Third, we leverage MLLM to construct checklist questions across four evaluation metrics. Finally, we conduct quality control and data refinement for the constructed dataset and checklists.
  • Figure 3: Evaluation pipeline of RULER-Bench.
  • Figure 4: Average performance of video generation models across different tasks on RULER-Bench. Video models generally perform best on tasks in Humanity and Hypothesis, while showing lower performance on Vision and Game categories.
  • Figure 5: Case studies on three closed-source models across six rule categories. Each sample is provided with the Rule Coherence aspects derived from the checklist questions. The three video models exhibit varying performance across different instances.
  • ...and 14 more figures