Understanding Physical Dynamics with Counterfactual World Modeling
Rahul Venkatesh, Honglin Chen, Kevin Feigelis, Daniel M. Bear, Khaled Jedoui, Klemen Kotar, Felix Binder, Wanhee Lee, Sherry Liu, Kevin A. Smith, Judith E. Fan, Daniel L. K. Yamins
TL;DR
The paper introduces Counterfactual World Modeling (CWM), a self-supervised framework that pretrains a single video predictor with a temporally-factored masking policy to concentrate transformation information into a small set of patch embeddings. By applying simple prompts and counterfactual interventions to this predictor, CWM can extract useful vision structures—keypoints, optical flow, and segmentations—without annotated data and use them to understand physical dynamics. On Physion v1.5, CWM achieves state-of-the-art results on object contact prediction and detection, while qualitative and ablation analyses show the extracted structures are meaningful and the prompting approach is effective. The work demonstrates that counterfactual prompting can uncover core causal structure in visual dynamics, enabling strong zero-shot analyses and broad transfer to related benchmarks such as activity recognition and IntPhys.
Abstract
The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor without finetuning on annotated datasets. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.
