Table of Contents
Fetching ...

WorldModelBench: Judging Video Generation Models As World Models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, Yao Lu

TL;DR

WorldModelBench introduces a physics- and instruction-focused benchmark for video generation models to serve as world models. It combines 7 application domains with 56 subdomains (350 prompts) and crowdsourced 67K human labels to evaluate instruction following, physics adherence, and commonsense, while also training a fine-tuned 2B-parameter judger to automate assessment. The work demonstrates significant gaps to ideal world-model behavior, shows that reward-based fine-tuning using the judger can improve world modeling, and provides a rigorous comparison to existing benchmarks, highlighting the need for physics-aware evaluation in the field. Overall, WorldModelBench supplies both granular evaluation data and a practical pathway to steer future video generation models toward reliable world modeling for decision-making tasks.

Abstract

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench-team.github.io.

WorldModelBench: Judging Video Generation Models As World Models

TL;DR

WorldModelBench introduces a physics- and instruction-focused benchmark for video generation models to serve as world models. It combines 7 application domains with 56 subdomains (350 prompts) and crowdsourced 67K human labels to evaluate instruction following, physics adherence, and commonsense, while also training a fine-tuned 2B-parameter judger to automate assessment. The work demonstrates significant gaps to ideal world-model behavior, shows that reward-based fine-tuning using the judger can improve world modeling, and provides a rigorous comparison to existing benchmarks, highlighting the need for physics-aware evaluation in the field. Overall, WorldModelBench supplies both granular evaluation data and a practical pathway to steer future video generation models toward reliable world modeling for decision-making tasks.

Abstract

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench-team.github.io.

Paper Structure

This paper contains 31 sections, 2 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Model A and B generate high quality videos, but the robotic arm in A's video is on the air, violating gravity. Established benchmarks focus on general video quality assessment, and does not distinguish videos that violate physical laws.
  • Figure 2: Overview of WorldModelBench. WorldModelBench judges the world modeling capability of video generation models across diverse application-driven domains. On WorldModelBench, a model generates a video based on text and optionally image conditions and is scored along commonsense, instruction following, and physics adherence dimensions. We collect 67K human labels to evaluate 14 frontier models. WorldModelBench is paired with a fine-tuned judger, providing fine-grained feedback for future models, and training to aligns its reward improves world modeling capabilities.
  • Figure 3: WorldModelBench consists of 7 domains and 56 subdomains, totaling 350 image and text conditions.
  • Figure 4: Examples of violations across physics categories.
  • Figure 5: We enhance video generation models by leveraging sparse rewards from our fine-tuned judger. Solid arrows indicate the forward process, while dashed lines are gradient directions.
  • ...and 6 more figures