Table of Contents
Fetching ...

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Zeqing Wang, Keze Wang, Lei Zhang

TL;DR

This work identifies a gap in evaluating physical plausibility in Text-to-Video models and shows that Vision-Language Models harbor latent physical reasoning that can be unlocked with targeted fine-tuning. It introduces the PID dataset and PhyDetEx, a LoRA-based detector and explainer that not only detects physically implausible events but also provides textual justifications. Through extensive experiments on PID and Impossible Videos, PhyDetEx achieves state-of-the-art performance and offers a robust benchmark for assessing modern T2V models, revealing open-source systems still struggle with basic physical laws while some closed-source models approach plausibility. Additionally, the framework enables physical-aware direct preference optimization (DPO) to further improve T2V outputs in terms of physical realism and commonsense reasoning.

Abstract

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

TL;DR

This work identifies a gap in evaluating physical plausibility in Text-to-Video models and shows that Vision-Language Models harbor latent physical reasoning that can be unlocked with targeted fine-tuning. It introduces the PID dataset and PhyDetEx, a LoRA-based detector and explainer that not only detects physically implausible events but also provides textual justifications. Through extensive experiments on PID and Impossible Videos, PhyDetEx achieves state-of-the-art performance and offers a robust benchmark for assessing modern T2V models, revealing open-source systems still struggle with basic physical laws while some closed-source models approach plausibility. Additionally, the framework enables physical-aware direct preference optimization (DPO) to further improve T2V outputs in terms of physical realism and commonsense reasoning.

Abstract

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of the physical implausibility detection task. Given a video where water is poured into a bottle, but the water level remains static and the cup passes through the bottle, humans can easily identify the violation of physical laws. However, current powerful VLMs (e.g., QwenVL and InternVL) incorrectly judge the motion as physically plausible, highlighting their difficulty in recognising implausible dynamics that are trivial for humans.
  • Figure 2: Results of preliminary experiments in the ImpossibleVideos under three prompting conditions (C1–C3) based on InternVL2.5 26B. As the prompt provides progressively stronger hints that the video may be generated by an AIGC model, the VLM achieve notably higher accuracy in detecting physically implausible videos and generate more accurate reasoning (higher reasoning scores). However, their accuracy on physically plausible videos decreases accordingly. These trends indicate that VLMs possess an implicit understanding of physical plausibility, yet their judgments are strongly biased by the type of the input.
  • Figure 3: Overview of the construction pipeline of the PID dataset and the training process of PhyDetEx. (a) The PID training split. The training split includes 2,588 paired videos, where each implausible video is generated by rewriting the caption of a real-world video to describe an implausible event while keeping other content unchanged. (b) The PID test split. The test set consists of physically implausible and physically plausible videos. The implausible subset is collected from videos generated by multiple T2V models based on physically plausible prompts, where human annotators identify implausible events and provide textual explanations. The plausible subset combines generated and real-world videos verified to contain no physical violations, thereby eliminating the shortcut of distinguishing between generated and real videos. (c) Training the PhyDetEx. Using the PID training split, we finetune the base VLM via LoRA adaptation to distinguish between physically plausible and implausible events within the same video contexts. The resulting model, PhyDetEx, achieves substantial improvements in detecting physically implausible content.
  • Figure 4: Qualitative comparison between our PhyDetEx and recent VLMs on detecting physical implausibility. We illustrate two representative cases from our PID test split. In the first case (top), the dolphin remains floating above the sea surface without descending, violating gravity. LLaVA-OneVision and MiniCPM-V misinterpret the scene as a normal leap. Our PhyDetEx correctly identifies the physical implausibility and attributes it to the lack of gravitational influence. In the second case (bottom), a woman jumps from the woods toward the river but remains suspended midair. Other VLMs again regard this as plausible or irrelevant to physics, whereas PhyDetEx identifies the implausible subject and provides the correct reasoning that her motion defies gravity and buoyancy.
  • Figure S1: Samples in our PID test split.
  • ...and 3 more figures