Table of Contents
Fetching ...

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao

TL;DR

PhyT2V addresses the gap in physics-consistent T2V generation by introducing a data-independent, LLM-guided iterative prompt refinement framework. It leverages chain-of-thought reasoning and global step-back prompting, coupled with a video-captioning feedback loop, to refine prompts without retraining diffusion-based models, formalized as $p' = f_{enhance}(p, f_{mismatch}(C(V(p)), p), f_{phy}(p), \theta)$. Across multiple T2V backbones and prompts, PhyT2V delivers up to 2.3× improvements in physical realism (PC) and semantic adherence (SA) and outperforms standard prompt enhancers by about 35%, demonstrating strong generalization to out-of-distribution domains. The approach is automation-friendly and model-agnostic, though it notes limitations such as temporal flickering and human-hand generation in some cases. Overall, PhyT2V offers a practical, scalable strategy to enforce real-world physics in video generation without retraining, broadening the applicability of T2V systems in simulation and planning tasks.

Abstract

Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

TL;DR

PhyT2V addresses the gap in physics-consistent T2V generation by introducing a data-independent, LLM-guided iterative prompt refinement framework. It leverages chain-of-thought reasoning and global step-back prompting, coupled with a video-captioning feedback loop, to refine prompts without retraining diffusion-based models, formalized as . Across multiple T2V backbones and prompts, PhyT2V delivers up to 2.3× improvements in physical realism (PC) and semantic adherence (SA) and outperforms standard prompt enhancers by about 35%, demonstrating strong generalization to out-of-distribution domains. The approach is automation-friendly and model-agnostic, though it notes limitations such as temporal flickering and human-hand generation in some cases. Overall, PhyT2V offers a practical, scalable strategy to enforce real-world physics in video generation without retraining, broadening the applicability of T2V systems in simulation and planning tasks.

Abstract

Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.

Paper Structure

This paper contains 21 sections, 32 figures, 6 tables.

Figures (32)

  • Figure 1: One iteration of video and prompt self-refinement in PhyT2V. Such self-refinement will be iteratively conducted in multiple rounds until the quality of generated video is satisfactory.
  • Figure 2: Examples of videos generated from in-distribution and out-of-distribution prompts, using the CogVideoX-5B model
  • Figure 3: A video generated by enhancing the out-of-distribution prompt "Whisking egg into milk for scramble" in Figure \ref{['fig:example']}
  • Figure 4: Examples of CoT and step-back reasoning
  • Figure 5: Our design of PhyT2V, illustrated by one round of video refinement consisting of three steps. Texts in brown are inputs from previous step. Texts in red are outputs to the next step; Texts in purple are prompts to trigger LLM reasoning
  • ...and 27 more figures