PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Zeqing Wang; Keze Wang; Lei Zhang

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Zeqing Wang, Keze Wang, Lei Zhang

TL;DR

This work identifies a gap in evaluating physical plausibility in Text-to-Video models and shows that Vision-Language Models harbor latent physical reasoning that can be unlocked with targeted fine-tuning. It introduces the PID dataset and PhyDetEx, a LoRA-based detector and explainer that not only detects physically implausible events but also provides textual justifications. Through extensive experiments on PID and Impossible Videos, PhyDetEx achieves state-of-the-art performance and offers a robust benchmark for assessing modern T2V models, revealing open-source systems still struggle with basic physical laws while some closed-source models approach plausibility. Additionally, the framework enables physical-aware direct preference optimization (DPO) to further improve T2V outputs in terms of physical realism and commonsense reasoning.

Abstract

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

TL;DR

Abstract

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)