"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang
TL;DR
PhyWorldBench introduces a large-scale, physics-centered benchmark for text-to-video models, probing adherence to real-world physics through 1,050 carefully annotated prompts and 12,600 generated videos across 12 models. It features a three-stage data creation pipeline, a Yes/No evaluation framework with semantic adherence and physical commonsense, and an automated MLLM-based evaluator (CAP) for zero-shot physics assessment. The study reveals persistent gaps in temporal consistency and physical realism, shows that physics-aware prompting modestly improves results, and highlights a risky tendency of higher-quality models to rationalize physics violations rather than actively enforce physical laws. By providing a comprehensive benchmark, CAP-based evaluation, and prompt-design guidance, PhyWorldBench aims to elevate the development of physically faithful, safe, and practically useful video-generation systems.
Abstract
Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.
