Table of Contents
Fetching ...

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, Saining Xie

TL;DR

This work introduces PISA, a physics-informed evaluation framework, and PisaBench to assess physical accuracy in video diffusion models using a controlled object-freefall task. The authors propose a two-stage post-training pipeline—Physics Supervised Fine-Tuning (PSFT) on simulated data followed by Object Reward Optimization (ORO) with modular rewards—demonstrating substantial improvements over baselines and notable sim-to-real gains. Reward models targeting segmentation, optical flow, and depth yield complementary benefits, yet generalization to unseen depths/heights and full distributional alignment remain challenging due to distributional gaps between simulation and reality. By releasing PisaBench, the paper provides a practical diagnostic tool for tracking progress toward physically grounded, generalizable world models in large-scale video generation systems.

Abstract

Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

TL;DR

This work introduces PISA, a physics-informed evaluation framework, and PisaBench to assess physical accuracy in video diffusion models using a controlled object-freefall task. The authors propose a two-stage post-training pipeline—Physics Supervised Fine-Tuning (PSFT) on simulated data followed by Object Reward Optimization (ORO) with modular rewards—demonstrating substantial improvements over baselines and notable sim-to-real gains. Reward models targeting segmentation, optical flow, and depth yield complementary benefits, yet generalization to unseen depths/heights and full distributional alignment remain challenging due to distributional gaps between simulation and reality. By releasing PisaBench, the paper provides a practical diagnostic tool for tracking progress toward physically grounded, generalizable world models in large-scale video generation systems.

Abstract

Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.

Paper Structure

This paper contains 29 sections, 19 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Our PISA (Physics-Informed Simulation and Alignment) evaluation framework includes a new video dataset, where objects are dropped in a variety of real-world (left) and synthetic (right) scenes. For visualization purposes, we depict object motion by overlaying multiple video frames in each image shown above. Our real-world videos enable us to evaluate the physical accuracy of generated video output, and our synthetic videos enable us to improve accuracy through the use of post-training alignment methods.
  • Figure 2: The setup for collecting real-world videos.
  • Figure 3: Statistics of the real-world data: (a) number of objects in each video, (b) the proportions of different scenes in the videos.
  • Figure 4: Examples of various objects included in our dataset. For simulation, we utilize the GSO dataset downs2022google, while for the real-world dataset, we curate our own set of common household objects.
  • Figure 5: Example of annotations in real-world data. For segmentation masks, we manually annotate first frame and utilize SAM 2 to produce segmentation masks across frames. For captions, we annotate “{object description} falls.” for all video segments.
  • ...and 17 more figures