PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop
Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, Saining Xie
TL;DR
This work introduces PISA, a physics-informed evaluation framework, and PisaBench to assess physical accuracy in video diffusion models using a controlled object-freefall task. The authors propose a two-stage post-training pipeline—Physics Supervised Fine-Tuning (PSFT) on simulated data followed by Object Reward Optimization (ORO) with modular rewards—demonstrating substantial improvements over baselines and notable sim-to-real gains. Reward models targeting segmentation, optical flow, and depth yield complementary benefits, yet generalization to unseen depths/heights and full distributional alignment remain challenging due to distributional gaps between simulation and reality. By releasing PisaBench, the paper provides a practical diagnostic tool for tracking progress toward physically grounded, generalizable world models in large-scale video generation systems.
Abstract
Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development.
