Table of Contents
Fetching ...

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, Wangmeng Zuo

TL;DR

PhysChoreo tackles the gap between high-fidelity video generation and explicit physical controllability from a single image by introducing a two-stage approach: part-aware physics reconstruction and physics-editable simulation. It aligns per-part semantics with geometry through soft assignment and hierarchical cross-attention, and couples this with physics-enabled, temporally controllable dynamics that condition a video model for realistic outputs. A novel text–part–physics dataset provides ground truth for per-part physical properties, enabling robust training and evaluation. Experiments show state-of-the-art performance in both predicting continuous physical properties and generating instruction-following, physically plausible videos, highlighting the framework's potential for counterfactual and controllable physics in vision tasks.

Abstract

While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

TL;DR

PhysChoreo tackles the gap between high-fidelity video generation and explicit physical controllability from a single image by introducing a two-stage approach: part-aware physics reconstruction and physics-editable simulation. It aligns per-part semantics with geometry through soft assignment and hierarchical cross-attention, and couples this with physics-enabled, temporally controllable dynamics that condition a video model for realistic outputs. A novel text–part–physics dataset provides ground truth for per-part physical properties, enabling robust training and evaluation. Experiments show state-of-the-art performance in both predicting continuous physical properties and generating instruction-following, physically plausible videos, highlighting the framework's potential for counterfactual and controllable physics in vision tasks.

Abstract

While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

Paper Structure

This paper contains 15 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We propose PhysChoreo, a new framework for controllable image-to-video generation. PhysChoreo can reconstruct the material field of objects from a single image and generate physically realistic and dynamically rich videos. In (a) and (b), based on the reconstructed physical properties, physically realistic dynamics can be generated. In (c) and (d), by controlling the physical properties during the generation process, more cinematic videos can be generated while maintaining physical realism.
  • Figure 2: Overview of our pipeline. Given the input image and text prompt, we first reconstruct the initial material field of each object from the image. Then we generate the scene's trajectory video based on a physics-editable simulator with temporal instructions, and finally the trajectory video is used as conditional control to guide the generation of generative video model.
  • Figure 3: Overview of our model design. We first use a fused feature from point positional feature and segmentation prior. Afterward, we use a soft assignment to preliminarily display the injected part-level features, then perform fine-grained text-level adjustments through a hierarchical cross-attention stage, and finally obtain part-aware material field features via a transformer encoder.
  • Figure 4: Our model can achieve part-level physical property controllable prediction through text condition.
  • Figure 5: Qualitative comparison between PhysChoreo and existing image-to-video generation models.
  • ...and 2 more figures