VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
TL;DR
VLIPP tackles the gap in physically plausible video generation by introducing a two-stage framework that injects a vision-language informed physical prior into video diffusion. A VLM acts as a coarse-level motion planner, reasoning with chain-of-thought to predict approximate object trajectories under physics, while a VDM serves as a fine-level synthesizer, translating these trajectories into detailed motion with structured noise derived from optical flow. The approach leverages GPT-4o for scene understanding and Grounded-SAM2 for object localization, and employs RAFT-based optical flow to guide diffusion-based video synthesis, with a controlled noise injection scheme to balance adherence to the plan and realism. Evaluations on PhyGenBench and Physics-IQ demonstrate state-of-the-art physical realism and motion plausibility, supported by ablation and user studies, underscoring the practical impact for robust world-model-like video generation.
Abstract
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.
