Table of Contents
Fetching ...

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun

TL;DR

This work introduces Force Prompting, a framework that enables force-based conditioning of video generation via two modalities: local point forces and global wind forces. Trained on modest synthetic data (~15k global, ~23k local) generated with Blender and related tools, the approach leverages pretrained video priors (CogVideoX) and a ControlNet integration to realize force-conditioned video synthesis without runtime physics engines. It demonstrates strong generalization across objects, materials, and configurations, and reveals emergent mass-aware dynamics through targeted ablations and quantitative studies. The authors also conduct extensive ablations to show the importance of visual diversity and training-time text cues, and they release datasets, code, and interactive demos to foster further research in intuitive physics and interactive world models.

Abstract

Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

TL;DR

This work introduces Force Prompting, a framework that enables force-based conditioning of video generation via two modalities: local point forces and global wind forces. Trained on modest synthetic data (~15k global, ~23k local) generated with Blender and related tools, the approach leverages pretrained video priors (CogVideoX) and a ControlNet integration to realize force-conditioned video synthesis without runtime physics engines. It demonstrates strong generalization across objects, materials, and configurations, and reveals emergent mass-aware dynamics through targeted ablations and quantitative studies. The authors also conduct extensive ablations to show the importance of visual diversity and training-time text cues, and they release datasets, code, and interactive demos to foster further research in intuitive physics and interactive world models.

Abstract

Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

Paper Structure

This paper contains 31 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Force prompting allows users to apply either global or local forces to objects in an image and then generate the resultant video. Despite being trained on a limited set of synthetic videos (15k for global force and 23k for local force), we observe significant generalization to different settings, materials, objects, geometries, affordances, and some initial hints at mass understanding. Trajectory visualization or alpha overlay are incorporated to better illustrate movement for some examples.
  • Figure 2: Visualizing the point force control signal. The magnitude of applied force is proportional to the gaussian blob's velocity in the control signal, producing proportionally stronger impulses. Stronger forces (bottom) generate faster-moving blobs and correspondingly larger physical responses than gentler forces (top). Note, red line added at the same location in each image for visualization. In our method, we enable the force prompt to dictate the object's trajectory, deliberately excluding such specifics from the text prompt.
  • Figure 3: Qualitative results for the Local Force (Poke) model. Top section: For local forces, the control signal can specify both the location, magnitude, and direction of the force. Bottom section: despite the limited training data, the model generalizes to different types of motion. We add blue lines to visualize a time-lapse of some objects' movements.
  • Figure 4: Qualitative results for the Global Force (Wind) model. Top: from the same starting image, different directions for the force result in different videos. Bottom: while the model was only trained on flags, it can generalize to many different settings producing different types of motion.
  • Figure 5: Results from our ablation studies on synthetic dataset design choices.Left: when the global wind force model is trained on a dataset with only one flag, it overfits, causing the woman's arm to wave unnaturally like fabric. Middle: when trained with a single background, the global force model has significantly degraded overall visual quality. Right: when trained without distractor objects, the point force model cannot properly localize motion, applying forces indiscriminately rather than to the intended target.
  • ...and 7 more figures