Table of Contents
Fetching ...

Interpretable Representation Learning from Videos using Nonlinear Priors

Marian Longa, João F. Henriques

TL;DR

A deep learning framework where one can specify nonlinear priors for videos that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time is proposed.

Abstract

Learning interpretable representations of visual data is an important challenge, to make machines' decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time. We do this by extending the Variational Auto-Encoder (VAE) prior from a simple isotropic Gaussian to an arbitrary nonlinear temporal Additive Noise Model (ANM), which can describe a large number of processes (e.g. Newtonian physics). We propose a novel linearization method that constructs a Gaussian Mixture Model (GMM) approximating the prior, and derive a numerically stable Monte Carlo estimate of the KL divergence between the posterior and prior GMMs. We validate the method on different real-world physics videos including a pendulum, a mass on a spring, a falling object and a pulsar (rotating neutron star). We specify a physical prior for each experiment and show that the correct variables are learned. Once a model is trained, we intervene on it to change different physical variables (such as oscillation amplitude or adding air drag) to generate physically correct videos of hypothetical scenarios that were not observed previously.

Interpretable Representation Learning from Videos using Nonlinear Priors

TL;DR

A deep learning framework where one can specify nonlinear priors for videos that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time is proposed.

Abstract

Learning interpretable representations of visual data is an important challenge, to make machines' decisions understandable to humans and to improve generalisation outside of the training distribution. To this end, we propose a deep learning framework where one can specify nonlinear priors for videos (e.g. of Newtonian physics) that allow the model to learn interpretable latent variables and use these to generate videos of hypothetical scenarios not observed at training time. We do this by extending the Variational Auto-Encoder (VAE) prior from a simple isotropic Gaussian to an arbitrary nonlinear temporal Additive Noise Model (ANM), which can describe a large number of processes (e.g. Newtonian physics). We propose a novel linearization method that constructs a Gaussian Mixture Model (GMM) approximating the prior, and derive a numerically stable Monte Carlo estimate of the KL divergence between the posterior and prior GMMs. We validate the method on different real-world physics videos including a pendulum, a mass on a spring, a falling object and a pulsar (rotating neutron star). We specify a physical prior for each experiment and show that the correct variables are learned. Once a model is trained, we intervene on it to change different physical variables (such as oscillation amplitude or adding air drag) to generate physically correct videos of hypothetical scenarios that were not observed previously.

Paper Structure

This paper contains 27 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Left: architecture to compute the reconstruction loss. Right: architecture to compute the KL divergence loss. See sec. \ref{['sec:method']} for details.
  • Figure 2: Latent space visualisation showing the prior generated using the linearization method (blue), posterior generated by encoding many images from the dataset (red), and the intervened prior (green), shown for experiments (a-d).
  • Figure 3: Original and counterfactual videos for experiments (a-d) obtained by decoding the original and intervened priors.
  • Figure 4: Crab pulsar intensity data (black) and a double Gaussian mixture fit (red) at the original UVB frequency (left) and at the intervention frequency 1418 MHz (right).