Table of Contents
Fetching ...

Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video

Miguel Jaques, Michael Burke, Timothy Hospedales

TL;DR

The paper addresses unsupervised physical parameter estimation and state discovery from video when the governing dynamics are known but object-level labels are unavailable. It proposes physics-as-inverse-graphics, which combines vision-based inverse graphics with a differentiable physics engine, using a coordinate-consistent decoder to render predictions from latent object coordinates and velocities. The approach yields accurate long-term video predictions and enables data-efficient vision-based model-predictive control, demonstrated on multiple dynamical systems and an OpenAI Gym pendulum. Key contributions include end-to-end unsupervised learning of physical parameters, explicit interpretable states, and successful zero-shot adaptation through physics reasoning, all enabled by the tight coupling of vision and differentiable physics. This framework advances physics-grounded scene understanding and control from pixels with minimal supervision, with potential for broader applicability in vision-guided robotics and scientific inference.

Abstract

We propose a model that is able to perform unsupervised physical parameter estimation of systems from video, where the differential equations governing the scene dynamics are known, but labeled states or objects are not available. Existing physical scene understanding methods require either object state supervision, or do not integrate with differentiable physics to learn interpretable system parameters and states. We address this problem through a physics-as-inverse-graphics approach that brings together vision-as-inverse-graphics and differentiable physics engines, enabling objects and explicit state and velocity representations to be discovered. This framework allows us to perform long term extrapolative video prediction, as well as vision-based model-predictive control. Our approach significantly outperforms related unsupervised methods in long-term future frame prediction of systems with interacting objects (such as ball-spring or 3-body gravitational systems), due to its ability to build dynamics into the model as an inductive bias. We further show the value of this tight vision-physics integration by demonstrating data-efficient learning of vision-actuated model-based control for a pendulum system. We also show that the controller's interpretability provides unique capabilities in goal-driven control and physical reasoning for zero-data adaptation.

Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video

TL;DR

The paper addresses unsupervised physical parameter estimation and state discovery from video when the governing dynamics are known but object-level labels are unavailable. It proposes physics-as-inverse-graphics, which combines vision-based inverse graphics with a differentiable physics engine, using a coordinate-consistent decoder to render predictions from latent object coordinates and velocities. The approach yields accurate long-term video predictions and enables data-efficient vision-based model-predictive control, demonstrated on multiple dynamical systems and an OpenAI Gym pendulum. Key contributions include end-to-end unsupervised learning of physical parameters, explicit interpretable states, and successful zero-shot adaptation through physics reasoning, all enabled by the tight coupling of vision and differentiable physics. This framework advances physics-grounded scene understanding and control from pixels with minimal supervision, with potential for broader applicability in vision-guided robotics and scientific inference.

Abstract

We propose a model that is able to perform unsupervised physical parameter estimation of systems from video, where the differential equations governing the scene dynamics are known, but labeled states or objects are not available. Existing physical scene understanding methods require either object state supervision, or do not integrate with differentiable physics to learn interpretable system parameters and states. We address this problem through a physics-as-inverse-graphics approach that brings together vision-as-inverse-graphics and differentiable physics engines, enabling objects and explicit state and velocity representations to be discovered. This framework allows us to perform long term extrapolative video prediction, as well as vision-based model-predictive control. Our approach significantly outperforms related unsupervised methods in long-term future frame prediction of systems with interacting objects (such as ball-spring or 3-body gravitational systems), due to its ability to build dynamics into the model as an inductive bias. We further show the value of this tight vision-physics integration by demonstrating data-efficient learning of vision-actuated model-based control for a pendulum system. We also show that the controller's interpretability provides unique capabilities in goal-driven control and physical reasoning for zero-data adaptation.

Paper Structure

This paper contains 13 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Left: High-level view of our architecture. The encoder (top-right) estimates the position of $N$ objects in each input frame. These are passed to the velocity estimator which estimates objects' velocities at the last input frame. The positions and velocities of the last input frame are passed as initial conditions to the physics engine. At every time-step, the physics engine outputs a set of positions, which are used by the decoder (bottom-right) to output a predicted image. If the system is actuated, an input action is passed to the physics engine at every time-step. See Section 3 for detailed descriptions of the encoder and decoder architectures.
  • Figure 2: Future frame predictions for 3-ball gravitational system (top) and 2-digit spring system (bottom). IN: Interaction Network. Only the combination of Physics and Inverse-Graphics maintains object integrity and correct dynamics many steps into the future.
  • Figure 3: Frame prediction accuracy (SSI, higher is better) for the balls datasets. Left of the green dashed line corresponds to the training range, $T_{pred}$, right corresponds to extrapolation, $T_{ext}$. We outperform Interaction Networks (IN) Watters2017VisualVideo, DDPAE Hsieh2018LearningPrediction and VideoLSTM Srivastava2015UnsupervisedLSTMs in extrapolation due to incorporating explicit physics.
  • Figure 4: Contents and masks learned by the decoder. Object masks: $\sigma(\mathbf{m})$. Objects for rendering: $\sigma(\mathbf{m})\odot \mathbf{c}$. Contents and masks correctly capture each part of the scene: colored balls, MNIST digits and CIFAR background. We omit the black background learned on the balls dataset.
  • Figure 5: Top: Comparison between our model and PlaNet hafner2018Planet in terms of learning sample efficiency (left). Explicit physics allows reasoning for zero-shot adaptation to domain-shift in gravity (center) and goal-driven control to balance the pendulum in any position (right). DDPG (VAE) corresponds to a DDPG agent trained on the latent space of an autoencoder (trained with 320k images) after 80k steps. DDPG (proprio) corresponds to an agent trained from proprioception after 30k steps. Bottom: The first 3 rows show a zero-shot counterfactual episode with a gravity multiplier of 1.4 for an oracle, our model and planet, with vertical as the target position (as trained). The last row shows an episode using a goal image to infer the non-vertical goal state.
  • ...and 6 more figures