Table of Contents
Fetching ...

Understanding Physical Dynamics with Counterfactual World Modeling

Rahul Venkatesh, Honglin Chen, Kevin Feigelis, Daniel M. Bear, Khaled Jedoui, Klemen Kotar, Felix Binder, Wanhee Lee, Sherry Liu, Kevin A. Smith, Judith E. Fan, Daniel L. K. Yamins

TL;DR

The paper introduces Counterfactual World Modeling (CWM), a self-supervised framework that pretrains a single video predictor with a temporally-factored masking policy to concentrate transformation information into a small set of patch embeddings. By applying simple prompts and counterfactual interventions to this predictor, CWM can extract useful vision structures—keypoints, optical flow, and segmentations—without annotated data and use them to understand physical dynamics. On Physion v1.5, CWM achieves state-of-the-art results on object contact prediction and detection, while qualitative and ablation analyses show the extracted structures are meaningful and the prompting approach is effective. The work demonstrates that counterfactual prompting can uncover core causal structure in visual dynamics, enabling strong zero-shot analyses and broad transfer to related benchmarks such as activity recognition and IntPhys.

Abstract

The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor without finetuning on annotated datasets. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.

Understanding Physical Dynamics with Counterfactual World Modeling

TL;DR

The paper introduces Counterfactual World Modeling (CWM), a self-supervised framework that pretrains a single video predictor with a temporally-factored masking policy to concentrate transformation information into a small set of patch embeddings. By applying simple prompts and counterfactual interventions to this predictor, CWM can extract useful vision structures—keypoints, optical flow, and segmentations—without annotated data and use them to understand physical dynamics. On Physion v1.5, CWM achieves state-of-the-art results on object contact prediction and detection, while qualitative and ablation analyses show the extracted structures are meaningful and the prompting approach is effective. The work demonstrates that counterfactual prompting can uncover core causal structure in visual dynamics, enabling strong zero-shot analyses and broad transfer to related benchmarks such as activity recognition and IntPhys.

Abstract

The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor without finetuning on annotated datasets. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.
Paper Structure (49 sections, 8 equations, 22 figures, 10 tables)

This paper contains 49 sections, 8 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 1: Overview of the approach. Given an input video of a physical scenario, we extract feature representations and vision structures such as keypoints, optical flow, and segments. These structures are extracted from a single pre-trained CWM predictor without finetuning on annotated datasets. We use the extracted features and structures for dynamics understanding - detecting a past collision or predicting a future collision.
  • Figure 2: Climbing the Ladder of Causation with the CWM framework: (a) Temporally-factored masked predictor for association learning. Given a frame pair input, the predictor takes in dense visible patches from the first frame and only a sparse subset of patches from the second frame as inputs, and learns to predict the masked patches. This policy encourages the model to concentrate scene dynamics into embeddings of a few patches. (b) Prompting as interventations. As a result of the temporally-factored masking, we can intervene by modifying one or a few visual patches in the prompt and steer the outcome of the predictor. (c) Structure extraction using counterfactuals. Multiple vision structures can be extracted by comparing the results of interventions to alternative futures (e.g. observed ground truth or observed predictions).
  • Figure 3: Counterfactual predictions and structure extraction. (a) Counterfactual predictions. A small number of visual patches exert meaningful control of scene dynamics. Each panel shows a prompt consisting of the input image (left), a few patches copied from the input image (middle), and the resulting predictions (right). A red patch is copied into the same location as its source, simulating the appearance of holding an object fixed. A green patch is copied into a different location at an offset from the source location, simulating the appearance of an apparent object motion. (b) Structure extraction for keypoints, flows, and segments
  • Figure 4: Physion v1.5 evaluation protocol. We evaluate on two physical dynamics understanding tasks -- (a) Object contact prediction where the model is asked to predict contact events in the future and (b) Object contact detection where the model is asked to reason about contact events that occur in the observed video stimulus. The objects of interest for which we want to ask the contact question are rendered with red and yellow texture to cue the model.
  • Figure 5: Qualitative comparison of counterfactual motion prediction and structure extraction on real-world datasets. We find that when we apply our extraction procedures described in Section. \ref{['sec:programs_struct']} on VideoMAE, the model fails to generate counterfactual motion and extracts less meaningful structures than CWM. Segments cannot be extracted from VideoMAE due to the failure of counterfactual predictions, and hence not shown in this comparison. This shows the importance of the temporally-factored masking policy during pre-training.
  • ...and 17 more figures