CoPhy: Counterfactual Learning of Physical Dynamics
Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, Christian Wolf
TL;DR
This work introduces CoPhy, a benchmark for counterfactual learning of physical dynamics from visual input, and a neural model (CoPhyNet) that infers latent confounders without supervision to predict counterfactual outcomes after interventions. The model combines object-centric de-rendering, graph neural networks for inter-object interactions, per-object GRUs to derive latent confounders, and a stability-gated trajectory predictor to generate counterfactual trajectories. Experiments show CoPhyNet outperforms traditional feedforward predictors on three scenarios and generalizes to unseen confounder configurations and object counts, with humans proving more error-prone on counterfactual tasks. The work advances causal reasoning in high-dimensional perception and has implications for model-based reinforcement learning and robust physical reasoning in AI.
Abstract
Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world. This work poses a new problem of counterfactual learning of object mechanics from visual input. We develop the CoPhy benchmark to assess the capacity of the state-of-the-art models for causal physical reasoning in a synthetic 3D environment and propose a model for learning the physical dynamics in a counterfactual setting. Having observed a mechanical experiment that involves, for example, a falling tower of blocks, a set of bouncing balls or colliding objects, we learn to predict how its outcome is affected by an arbitrary intervention on its initial conditions, such as displacing one of the objects in the scene. The alternative future is predicted given the altered past and a latent representation of the confounders learned by the model in an end-to-end fashion with no supervision. We compare against feedforward video prediction baselines and show how observing alternative experiences allows the network to capture latent physical properties of the environment, which results in significantly more accurate predictions at the level of super human performance.
