Table of Contents
Fetching ...

PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, Nathan Jacobs

Abstract

Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at https://physalign.github.io/PhysAlign.

PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

Abstract

Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at https://physalign.github.io/PhysAlign.
Paper Structure (26 sections, 14 equations, 6 figures, 10 tables)

This paper contains 26 sections, 14 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparisons of image-to-video (I2V) generation results on physics-involved scenarios. Red arrows indicate the initial motion directions of the objects, while green rectangles highlight key subjects that reveal underlying physical behaviors and dynamics. We introduce PhysAlign, an I2V framework that effectively infuses physical knowledge and 3D geometric priors into existing video generation models. PhysAlign significantly enhances physical coherence and 3D perceptual fidelity, generating videos that most faithfully conform to real-world physical laws.
  • Figure 2: PhysAlign framework. Our data generation pipeline leverages physical simulator (i.e. Blender) to generate synthetic videos with 3D physical ground truth. Our method aligns the DiT peebles2023scalable latent features with both (i) physical knowledge feature by V-JEPA2 assran2025v, and (ii) 3D geometric feature encoded from synthetic ground truth (e.g., depth). This unified alignment internalizes both physical laws and visual fidelity for I2V generation task.
  • Figure 3: Comparison of our result with other baseline models on WISA-test set. Results show that our method shows better understanding of the physics law, which demostrates our method's strong generalization ability to act as real-world simulator. Zoom-in for details.
  • Figure 4: Visualization of the synthetic data generated though our data generation pipeline.
  • Figure 5: The visualization of our user study page.
  • ...and 1 more figures