Table of Contents
Fetching ...

MotionCraft: Physics-based Zero-Shot Video Generation

Luca Savant Aira, Antonio Montanaro, Emanuele Aiello, Diego Valsesia, Enrico Magli

TL;DR

MotionCraft tackles zero-shot video generation by warping the latent noise space of a pretrained image diffusion model using optical flow from physics simulations, enabling realistic, physically grounded motion without training. It introduces MCFA (Multiple Cross-Frame Attention) and Spatial-eta sampling to preserve content while allowing new content generation guided by the prescribed motion. Quantitative and qualitative results across rigid-body, fluid, and multi-agent dynamics show improvements over the zero-shot Text2Video-Zero baseline, highlighting the method's ability to produce coherent, plausible animations. The work highlights practical implications for physics-based visualization and outlines limitations such as color shifts and inversion fidelity, pointing to future directions like more complex mixed-physics scenes and feedback loops.

Abstract

Generating videos with realistic and physically plausible motion is one of the main recent challenges in computer vision. While diffusion models are achieving compelling results in image generation, video diffusion models are limited by heavy training and huge models, resulting in videos that are still biased to the training dataset. In this work we propose MotionCraft, a new zero-shot video generator to craft physics-based and realistic videos. MotionCraft is able to warp the noise latent space of an image diffusion model, such as Stable Diffusion, by applying an optical flow derived from a physics simulation. We show that warping the noise latent space results in coherent application of the desired motion while allowing the model to generate missing elements consistent with the scene evolution, which would otherwise result in artefacts or missing content if the flow was applied in the pixel space. We compare our method with the state-of-the-art Text2Video-Zero reporting qualitative and quantitative improvements, demonstrating the effectiveness of our approach to generate videos with finely-prescribed complex motion dynamics. Project page: https://mezzelfo.github.io/MotionCraft/

MotionCraft: Physics-based Zero-Shot Video Generation

TL;DR

MotionCraft tackles zero-shot video generation by warping the latent noise space of a pretrained image diffusion model using optical flow from physics simulations, enabling realistic, physically grounded motion without training. It introduces MCFA (Multiple Cross-Frame Attention) and Spatial-eta sampling to preserve content while allowing new content generation guided by the prescribed motion. Quantitative and qualitative results across rigid-body, fluid, and multi-agent dynamics show improvements over the zero-shot Text2Video-Zero baseline, highlighting the method's ability to produce coherent, plausible animations. The work highlights practical implications for physics-based visualization and outlines limitations such as color shifts and inversion fidelity, pointing to future directions like more complex mixed-physics scenes and feedback loops.

Abstract

Generating videos with realistic and physically plausible motion is one of the main recent challenges in computer vision. While diffusion models are achieving compelling results in image generation, video diffusion models are limited by heavy training and huge models, resulting in videos that are still biased to the training dataset. In this work we propose MotionCraft, a new zero-shot video generator to craft physics-based and realistic videos. MotionCraft is able to warp the noise latent space of an image diffusion model, such as Stable Diffusion, by applying an optical flow derived from a physics simulation. We show that warping the noise latent space results in coherent application of the desired motion while allowing the model to generate missing elements consistent with the scene evolution, which would otherwise result in artefacts or missing content if the flow was applied in the pixel space. We compare our method with the state-of-the-art Text2Video-Zero reporting qualitative and quantitative improvements, demonstrating the effectiveness of our approach to generate videos with finely-prescribed complex motion dynamics. Project page: https://mezzelfo.github.io/MotionCraft/
Paper Structure (25 sections, 4 equations, 19 figures, 1 table, 1 algorithm)

This paper contains 25 sections, 4 equations, 19 figures, 1 table, 1 algorithm.

Figures (19)

  • Figure 1: Melting man simulation. Top: MotionCraft; Bottom: T2V0 khachatryan2023text2video. MotionCraft uses a fluid dynamics simulation to warp noise latents and synthetize video frames. T2V0 is unable to simulate the evolution of the melting statue and simply moves the object towards the bottom of the frame.
  • Figure 2: A qualitative example of the image and latent flows correlation. This figure shows, from left to right, (a) the first RGB frame, (b) the second RGB frame superimposed with the estimated flow in the RGB domain, (c) the first latent frame, (d) the second latent frame superimposed with the estimated flow in the latent domain and (e) the correlation map of the two non-zero flows.
  • Figure 3: MotionCraft overview. A video is generated from a starting image using a pretrained still image generative model by warping noise latents according to an optical flow description of the motion to be synthesised.
  • Figure 4: Rigid motion simulation: satellite orbit. Top: MotionCraft; Bottom: T2V0 khachatryan2023text2video.
  • Figure 5: Rigid motion simulation: revolving Earth. Top: MotionCraft; Bottom: T2V0 khachatryan2023text2video.
  • ...and 14 more figures