Table of Contents
Fetching ...

Guiding Video Prediction with Explicit Procedural Knowledge

Patrick Takenaka, Johannes Maucher, Marco F. Huber

TL;DR

An architecture is developed that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and a setup is established that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction.

Abstract

We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.

Guiding Video Prediction with Explicit Procedural Knowledge

TL;DR

An architecture is developed that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and a setup is established that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction.

Abstract

We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.

Paper Structure

This paper contains 13 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Abstract structure of our proposed procedural knowledge integration interface. Features of $X$ are extracted in model $M$, resulting in intermediate feature maps (shown in grey). Out of these, a selected feature map $z$ is then used to decode it into the input space of the integrated procedural module $f$ through $M_{f_\mathrm{in}}$, and the output of $f$ is encoded back into the latent space of $z$ using $M_{f_\mathrm{out}}$. $M$ continues with this updated latent state to obtain prediction $\hat{y}$.
  • Figure 2: Overview of the prediction of latent state $z$ at time step $t$, given the previous latent states $\textbf{z}$ of time steps $t-N\ldots t-1$, where $N$ is the number of context frames. The encoder model $S_{\mathrm{enc}}$ transforms the fixed latent space of $\textbf{z}$ into a separable latent space composed of dynamics state $\textbf{z}_d$ and Gestalt state $\textbf{z}_g$. Both are fed through the joint Gestalt and dynamics prediction model $G$ to obtain dynamics correction $z_{d_{\mathrm{cor}}}$ and future Gestalt state $z_g$, whereas only $\textbf{z}_d$ is given to the explicit dynamics model $D$ to get the explicit dynamics prediction $z_{d_{\mathrm{exp}}}$. Both $z_{d_{\mathrm{exp}}}$ and $z_{d_{\mathrm{cor}}}$ are fused with fusion method $F$, resulting in the future dynamics state $z_d$. Finally, $z_d$ and $z_g$ are concatenated and fed through the state decoder $S_{\mathrm{dec}}$ to obtain future latent state $z$. In terms of Fig. \ref{['fig:abstract_integration']}, the integrated physics engine within $D$ corresponds to $f$, with $M_{f_{\mathrm{in}}}$ being the computational graph starting from $S_{\mathrm{enc}}$ up until the physics engine, and $M_{f_{\mathrm{out}}}$ the subsequent dynamics computations until after $S_{\mathrm{dec}}$. The dynamics correction and Gestalt computations are not shown explicitly and are part of $M$.
  • Figure 3: The mIoU performance w.r.t. each auto-regressive frame prediction. While the data-driven model exponentially becomes more inaccurate over time, the integration of the dynamics knowledge helps to keep the prediction performance stable. The pure variant of our architecture without a data-driven Gestalt and dynamics predictor follows the slope of our main architecture, albeit at a lower magnitude. Their difference indicates the missing handling of Gestalt and dynamics interdependencies.
  • Figure 4: Sample prediction comparisons for different unroll steps. While both models are able to keep object shapes intact, the dynamics of the SlotFormer model are diverging quickly, while our model can keep up with the complex dynamics.