Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Qin Zhang; Peiyu Jing; Hong-Xing Yu; Fangqiang Ding; Fan Nie; Weimin Wang; Yilun Du; James Zou; Jiajun Wu; Bing Shuai

Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie, Weimin Wang, Yilun Du, James Zou, Jiajun Wu, Bing Shuai

Abstract

Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.

Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Abstract

Paper Structure (22 sections, 2 equations, 12 figures, 5 tables)

This paper contains 22 sections, 2 equations, 12 figures, 5 tables.

Introduction
Related Work
Video Source Curation
Human Evaluation of Physical Realism
Perceptual Detection by Ordinary Viewers
Experiment Setup
Results
Physion-Eval: Physical Reasoning Benchmark
Annotation Protocol
Comparison of Human and MLLM Reasoning
Diagnosing Video Generation Models
Conclusion
Task Definition
Exocentric Video Curation
Evaluation Prompts
...and 7 more sections

Figures (12)

Figure 1: Physion-Eval Benchmark.(Left) The benchmark spans diverse physical phenomena across egocentric and exocentric views, evaluating videos generated by five state-of-the-art generation models. (Right, top) Physion-Eval provides 10,990 expert-annotated reasoning traces with timestamped glitch localization, structured failure categories, and natural-language explanations. (Right, bottom) Results reveal a large physical realism gap: 83.3% of exocentric and 93.5% of egocentric generated videos contain at least one human-identifiable physical glitch, motivating physics-grounded video generation and automated critics.
Figure 2: Examples of physical glitches in AI-generated videos from Physion-Eval. Each row shows a representative failure mode where generated dynamics violate basic physical principles. Frame sequences illustrate how these glitches emerge over time.
Figure 3: Two complementary human evaluation studies for assessing physical realism in generated videos. (a) Perceptual detection by ordinary viewers. Untrained viewers evaluate a blinded 1:1 mixture of real-world videos and outputs from five video generation models, judging whether each clip appears physically realistic. The evaluation metric measures how often generated videos are perceived as physically realistic relative to real videos. (b) Physion-Eval expert reasoning benchmark. Expert annotators follow a three-expert workflow to annotate generated videos, producing temporally localized failures, category labels, severity scores, and natural-language explanations. The final dataset contains 10,990 adjudicated reasoning annotations for diagnosing failure modes in video generation models.
Figure 4: Evaluation results of the untrained human study across video generation models under (a) exocentric and (b) egocentric settings. The radial plots (left) visualize Youden's J statistic ($J_G$) for each evaluator, while the tables (right) report the corresponding metrics $\pi_R$, $\pi_G$ and $J_G$. Across models, untrained human viewers consistently achieve higher J scores than current MLLM critics, indicating a stronger sensitivity to physical glitches in genereated videos, especially in the egocentric setting.
Figure 5: Example from Physion-Eval comparing expert human annotations and MLLM reasoning. In this example, human annotators correctly identify that the ice object sprays water without a visible cause and later increases in volume while melting, violating expected causal behavior and mass conservation. In contrast, Gemini 3.1 Pro hallucinates a non-existent shadow artifact, highlighting a substantial gap between human reasoning and current automated critics.
...and 7 more figures

Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Abstract

Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Authors

Abstract

Table of Contents

Figures (12)