Table of Contents
Fetching ...

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

TL;DR

PhyWorldBench introduces a large-scale, physics-centered benchmark for text-to-video models, probing adherence to real-world physics through 1,050 carefully annotated prompts and 12,600 generated videos across 12 models. It features a three-stage data creation pipeline, a Yes/No evaluation framework with semantic adherence and physical commonsense, and an automated MLLM-based evaluator (CAP) for zero-shot physics assessment. The study reveals persistent gaps in temporal consistency and physical realism, shows that physics-aware prompting modestly improves results, and highlights a risky tendency of higher-quality models to rationalize physics violations rather than actively enforce physical laws. By providing a comprehensive benchmark, CAP-based evaluation, and prompt-design guidance, PhyWorldBench aims to elevate the development of physically faithful, safe, and practically useful video-generation systems.

Abstract

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

TL;DR

PhyWorldBench introduces a large-scale, physics-centered benchmark for text-to-video models, probing adherence to real-world physics through 1,050 carefully annotated prompts and 12,600 generated videos across 12 models. It features a three-stage data creation pipeline, a Yes/No evaluation framework with semantic adherence and physical commonsense, and an automated MLLM-based evaluator (CAP) for zero-shot physics assessment. The study reveals persistent gaps in temporal consistency and physical realism, shows that physics-aware prompting modestly improves results, and highlights a risky tendency of higher-quality models to rationalize physics violations rather than actively enforce physical laws. By providing a comprehensive benchmark, CAP-based evaluation, and prompt-design guidance, PhyWorldBench aims to elevate the development of physically faithful, safe, and practically useful video-generation systems.

Abstract

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

Paper Structure

This paper contains 37 sections, 21 figures, 17 tables.

Figures (21)

  • Figure 1: Overview of PhyWorldBench. The benchmark follows a structured design, starting with 10 main physics categories, derived from physics literature and expert consultations. Each category is divided into 5 subcategories, capturing different aspects. Under each subcategory, 7 scenarios are created, with 3 prompt variations per scenario to provide varying levels of detail and complexity. The figure presents the benchmark structure, showcasing the 10 main categories and their corresponding 5 subcategories.
  • Figure 2: Success rates of video generation models on PhyWorldBench. Among open-source models, Wanx demonstrated the highest performance, while Pika achieved the best results among proprietary models with a success rate of 0.262. Despite these advancements, substantial progress remains necessary to refine the capability of these models to accurately simulate the intricate dynamics of the real world.
  • Figure 3: Creation Process of PhyWorldBench. The dataset is built through a three-stage pipeline for clarity, consistency, and completeness. First, physics categories and prompts are defined using literature and expert input. Next, GPT-4o, Gemini-1.5-Pro together with human refine prompts for diversity and accuracy. Finally, a curation phase standardizes all prompts, with human-in-the-loop reviews ensuring clarity and eliminating ambiguities.
  • Figure 4: Example of Three Prompt Types for a Scenario. This figure illustrates a scenario from the subcategory Linear Motion under the main category Object Motion and Kinematics. The scenario is presented through three levels of prompts: (1) Event Prompt, providing a concise and straightforward event description; (2) Physics-Enhanced Prompt, which builds on the general prompt by incorporating physics-related phenomena while avoiding explicit physical laws; and (3) Detailed Narrative Prompt, enriching the Evnet Prompt with vivid details and contextual elements.
  • Figure 5: Illustration of Our Evaluation Metric and Human Annotations. We demonstrate our evaluation process for assessing the quality of generated videos based on two evaluation criteria: Basic Standards and Key Standards. For Basic Standards, we verify whether the generated video contains the correct number of objects and accurately represents the intended action or event. For Key Standards, we define specific physical phenomena as ground truth and measure if all of these phenomena the generated video satisfies. Both lead to either a score of "0" or "1" for a generated video. Red circles and yellow lines in the figure highlight instances where the generated videos fail to meet the Key Standards.
  • ...and 16 more figures