Table of Contents
Fetching ...

CoPhy: Counterfactual Learning of Physical Dynamics

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, Christian Wolf

TL;DR

This work introduces CoPhy, a benchmark for counterfactual learning of physical dynamics from visual input, and a neural model (CoPhyNet) that infers latent confounders without supervision to predict counterfactual outcomes after interventions. The model combines object-centric de-rendering, graph neural networks for inter-object interactions, per-object GRUs to derive latent confounders, and a stability-gated trajectory predictor to generate counterfactual trajectories. Experiments show CoPhyNet outperforms traditional feedforward predictors on three scenarios and generalizes to unseen confounder configurations and object counts, with humans proving more error-prone on counterfactual tasks. The work advances causal reasoning in high-dimensional perception and has implications for model-based reinforcement learning and robust physical reasoning in AI.

Abstract

Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the CoPhy benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.

CoPhy: Counterfactual Learning of Physical Dynamics

TL;DR

This work introduces CoPhy, a benchmark for counterfactual learning of physical dynamics from visual input, and a neural model (CoPhyNet) that infers latent confounders without supervision to predict counterfactual outcomes after interventions. The model combines object-centric de-rendering, graph neural networks for inter-object interactions, per-object GRUs to derive latent confounders, and a stability-gated trajectory predictor to generate counterfactual trajectories. Experiments show CoPhyNet outperforms traditional feedforward predictors on three scenarios and generalizes to unseen confounder configurations and object counts, with humans proving more error-prone on counterfactual tasks. The work advances causal reasoning in high-dimensional perception and has implications for model-based reinforcement learning and robust physical reasoning in AI.

Abstract

Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the CoPhy benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.

Paper Structure

This paper contains 13 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We train a model for performing counterfactual learning of physical dynamics. Given an observed frame $\mathbf{A}=X_{0}$ and a sequence of future frames $\mathbf{B}=X_{1:\tau}$, we ask how the outcome $\mathbf{B}$ would have changed if we changed $X_{0}$ to $\bar{X}_0$ by performing a $do$-intervention (e.g. changing the initial positions of objects in the scene).
  • Figure 2: The difference between conventional video prediction (a) and counterfactual video prediction (b). The causal graph of the latter includes a confounder variable, which passes information from the original outcome to the outcome after do-intervention. The initially observed sequence $(\mathbf{A},\mathbf{B})$ (on the left) and the counterfactual sequence after the $do$-intervention (on the right).
  • Figure 3: Stability distribution for each confounder variable for heights $K{=}3$ and $K{=}4$ of the BlockTowerCF task. Masses, friction cooefficients: 2 configurations per block, $2^K$ total; gravity: 3 configurations for each axis ${\in}\{x,y\}$, 9 total.
  • Figure 4: Our model learns counterfactual reasoning in a weakly supervised way: while we supervise the do-operator, we do not supervise the confounder variables (masses, frictions, gravity). Input images of the original past ($\mathbf{A}$) and the original outcome ($\mathbf{B}$) are de-rendered into latent representations which are converted into fully-connected attributed graphs. A Graph Network updates node features to augment them with contextual information, which is integrated temporally with a set of RNNs, one for each object, running over time. The last hidden RNN state is taken as an estimate of the confounder $U$. A second set of GCN+RNN predicts residual object positions ($\mathbf{D}$) using the modified past ($\mathbf{C}$) and the confounder representation $U$. For clarity we draw arrows for the red object only. Not shown: stability prediction and gating.
  • Figure 5: Visual examples of human performance on the ill-posed task of feedforward, i.e. non-counterfactual, dynamic prediction from a single image (in the BlockTower scenario). The image shows the initial state $\mathbf{C}$. Small dots correspond to human estimates of the objects' final positions. Larger circles indicate ground truth final positions of each block. We note that this task is ill-posed by construction, as the dynamics of each experiment is defined by physical properties of each block (e.g. masses) which cannot be observed from a single image.
  • ...and 2 more figures