Table of Contents
Fetching ...

Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

Ke Zhang, Cihan Xiao, Jiacong Xu, Yiqun Mei, Vishal M. Patel

TL;DR

This work tackles the gap between visually realistic yet physically inaccurate video generation by integrating physical reasoning into diffusion-based video synthesis. DiffPhy leverages LLMs to infer physical context from prompts, uses a multimodal verifier to convert that reasoning into differentiable supervision, and fine-tunes a diffusion backbone with physics-aware losses and failure-focused attention. A real-world PhyHQ dataset supports robust training, and extensive benchmarks show state-of-the-art performance in physics plausibility across mechanics, optics, thermal, and material domains. The approach promises more reliable, physics-consistent video generation for applications in film, robotics, and embodied AI.

Abstract

Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.

Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

TL;DR

This work tackles the gap between visually realistic yet physically inaccurate video generation by integrating physical reasoning into diffusion-based video synthesis. DiffPhy leverages LLMs to infer physical context from prompts, uses a multimodal verifier to convert that reasoning into differentiable supervision, and fine-tunes a diffusion backbone with physics-aware losses and failure-focused attention. A real-world PhyHQ dataset supports robust training, and extensive benchmarks show state-of-the-art performance in physics plausibility across mechanics, optics, thermal, and material domains. The approach promises more reliable, physics-consistent video generation for applications in film, robotics, and embodied AI.

Abstract

Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.

Paper Structure

This paper contains 36 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: DiffPhy enables physically grounded and semantically aligned video generation across diverse real-world scenarios, including gravity-driven motion, fluid interactions, forceful impacts, and object manipulation. It outperforms state-of-the-art video diffusion model Wan 2.1–14B wan14b in both visual plausibility and physical coherence.
  • Figure 1: T2V comparisons on PhyGenBench.
  • Figure 2: We present DiffPhy with (a) a training paradigm and (b) an architectural overview. Figure (a) illustrates how we incorporate verified physical rules to guide gradient updates and attention injection on the latent variables during video diffusion steps. Figure (b) illustrates the network architecture and visualizes the attention injection mechanism.
  • Figure 2: PhyGenBench evaluation of phenomena detection, physical order, GPT-4o and open-source models.
  • Figure 3: Qualitative comparison with T2V models on the VideoPhy2. We show two challenging cases, i.e., sports and box collapse, where our results are more natural and description-consistent.
  • ...and 6 more figures