Table of Contents
Fetching ...

ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios using Procedural Knowledge

Patrick Takenaka, Johannes Maucher, Marco F. Huber

TL;DR

ViPro tackles video prediction in complex dynamical scenes by integrating procedural domain knowledge through a dedicated module $P$ that sits alongside a data-driven predictor. The latent state is decomposed into $z_a$, $z_b$, and $z_c$, separating symbolic dynamics, residual dynamics, and static appearance, with a loss $\mathcal{L} = \mathcal{L}_{rec} + \lambda \mathcal{L}_s$ to enforce symbolic-latent consistency. Evaluations on three Kubric-based datasets—Orbits, Acrobot, and Pendulum Camera—show that ViPro can outperform several baselines on challenging dynamics and support controllable predictions via symbolic interfaces and MPC integration. The results indicate that incorporating procedural knowledge reduces data requirements, improves transparency, and enables downstream control, paving the way for dynamic function libraries and neural program synthesis in future work.

Abstract

We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.

ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios using Procedural Knowledge

TL;DR

ViPro tackles video prediction in complex dynamical scenes by integrating procedural domain knowledge through a dedicated module that sits alongside a data-driven predictor. The latent state is decomposed into , , and , separating symbolic dynamics, residual dynamics, and static appearance, with a loss to enforce symbolic-latent consistency. Evaluations on three Kubric-based datasets—Orbits, Acrobot, and Pendulum Camera—show that ViPro can outperform several baselines on challenging dynamics and support controllable predictions via symbolic interfaces and MPC integration. The results indicate that incorporating procedural knowledge reduces data requirements, improves transparency, and enables downstream control, paving the way for dynamic function libraries and neural program synthesis in future work.

Abstract

We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.
Paper Structure (31 sections, 5 equations, 21 figures, 5 tables)

This paper contains 31 sections, 5 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Overview of our auto-regressive video prediction process. The first $n$ frames are used as reference and are encoded by the model in order to obtain an initial latent representation of the scene. After this burn-in phase the model has to rollout future $m$ frames on its own.
  • Figure 2: Left: Structure of our proposed procedural knowledge module$P$. Right: Abstract view of the burn-in phase for the object-centric variant of our architecture.
  • Figure 3: Qualitative performance of different model configurations compared to the ground-truth (GT). Predictions of selected frame iterations are shown from left to right. Our model is able to position objects correctly in future frames, while keeping object shading and overall appearance intact.
  • Figure 4: Frame predictions of different time steps (left-to-right) for the Pendulum Camera dataset, with the ground-truth being in the top row (GT).
  • Figure 5: Frame predictions of our model for different time steps (left-to-right) in the case of no changes to the latent vector (Normal), swapping $z_c$ between the object latent vectors (Swap $\mathbf{z_c}$) before decoding, and swapping both latent vectors $z_b$ and $z_c$ with the same permutation (Swap $\mathbf{z_b, z_c}$). Object appearances are swapped, but the dynamics stay unchanged.
  • ...and 16 more figures