Table of Contents
Fetching ...

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang

TL;DR

ProPhy addresses the gap in physics-consistent video generation by introducing a progressive, physics-aware conditioning framework. It deploys a two-stage Mixture-of-Physics-Experts with a Semantic Expert Block for global priors and a Refinement Expert Block for token-level dynamics, coupled with a VLM-guided fine-grained alignment strategy. The approach yields principled, anisotropic responses to localized physical cues and demonstrates state-of-the-art performance on physics-aware benchmarks, improving both physical plausibility and semantic adherence across backbones. The work advances world-simulation capabilities in diffusion-based video generation and suggests paths for integrating explicit physical laws in future research. Practical impact includes improved realism in physics-rich scenes and potential educational uses, with acknowledged limitations in annotation reliability and lack of explicit governing equations.

Abstract

Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

TL;DR

ProPhy addresses the gap in physics-consistent video generation by introducing a progressive, physics-aware conditioning framework. It deploys a two-stage Mixture-of-Physics-Experts with a Semantic Expert Block for global priors and a Refinement Expert Block for token-level dynamics, coupled with a VLM-guided fine-grained alignment strategy. The approach yields principled, anisotropic responses to localized physical cues and demonstrates state-of-the-art performance on physics-aware benchmarks, improving both physical plausibility and semantic adherence across backbones. The work advances world-simulation capabilities in diffusion-based video generation and suggests paths for integrating explicit physical laws in future research. Practical impact includes improved realism in physics-rich scenes and potential educational uses, with acknowledged limitations in annotation reliability and lack of explicit governing equations.

Abstract

Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

Paper Structure

This paper contains 34 sections, 6 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Top-left: Prior work typically relies on implicit alignment without explicit physical priors or uses video-level module routing as the source of physical awareness in video generation models. Top-right: Overview of our proposed ProPhy, a progressive alignment framework, which injects and aligns learnable physical priors and performs fine-grained token-level routing, enabling different experts to internalize different domains of physical knowledge. Bottom: Qualitative comparison between our method and prior work in complex scenarios. Red boxes and arrows indicate violations of physical laws.
  • Figure 2: Overview of our proposed ProPhy framework. ProPhy uses a progressive physical alignment design, consisting of the Semantic Expert Block and the Refinement Expert Block. During inference, the model runs end-to-end and aligns physics categories through our proposed blocks.
  • Figure 3: Pipeline for annotating token-level physical attributes using a VLM.
  • Figure 4: Study of the attention localization capabilities of VDM and VLM. The VDM cross-attention maps are obtained by adding 10% noise and then denoising. As shown, despite minor imperfections, the VLM-based approach more accurately identifies the locations of the corresponding physical phenomena.
  • Figure 5: Qualitative comparison among ProPhy, CogVideoX, Wan2.1, and existing physics-aware methods.
  • ...and 9 more figures