Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

Ke Zhang; Cihan Xiao; Jiacong Xu; Yiqun Mei; Vishal M. Patel

Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

Ke Zhang, Cihan Xiao, Jiacong Xu, Yiqun Mei, Vishal M. Patel

TL;DR

This work tackles the gap between visually realistic yet physically inaccurate video generation by integrating physical reasoning into diffusion-based video synthesis. DiffPhy leverages LLMs to infer physical context from prompts, uses a multimodal verifier to convert that reasoning into differentiable supervision, and fine-tunes a diffusion backbone with physics-aware losses and failure-focused attention. A real-world PhyHQ dataset supports robust training, and extensive benchmarks show state-of-the-art performance in physics plausibility across mechanics, optics, thermal, and material domains. The approach promises more reliable, physics-consistent video generation for applications in film, robotics, and embodied AI.

Abstract

Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.

Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

TL;DR

Abstract

Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)