PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Qiyao Xue; Xiangyu Yin; Boyuan Yang; Wei Gao

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao

TL;DR

PhyT2V addresses the gap in physics-consistent T2V generation by introducing a data-independent, LLM-guided iterative prompt refinement framework. It leverages chain-of-thought reasoning and global step-back prompting, coupled with a video-captioning feedback loop, to refine prompts without retraining diffusion-based models, formalized as $p' = f_{enhance}(p, f_{mismatch}(C(V(p)), p), f_{phy}(p), \theta)$. Across multiple T2V backbones and prompts, PhyT2V delivers up to 2.3× improvements in physical realism (PC) and semantic adherence (SA) and outperforms standard prompt enhancers by about 35%, demonstrating strong generalization to out-of-distribution domains. The approach is automation-friendly and model-agnostic, though it notes limitations such as temporal flickering and human-hand generation in some cases. Overall, PhyT2V offers a practical, scalable strategy to enforce real-world physics in video generation without retraining, broadening the applicability of T2V systems in simulation and planning tasks.

Abstract

Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

TL;DR

Abstract

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (32)