ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios using Procedural Knowledge
Patrick Takenaka, Johannes Maucher, Marco F. Huber
TL;DR
ViPro tackles video prediction in complex dynamical scenes by integrating procedural domain knowledge through a dedicated module $P$ that sits alongside a data-driven predictor. The latent state is decomposed into $z_a$, $z_b$, and $z_c$, separating symbolic dynamics, residual dynamics, and static appearance, with a loss $\mathcal{L} = \mathcal{L}_{rec} + \lambda \mathcal{L}_s$ to enforce symbolic-latent consistency. Evaluations on three Kubric-based datasets—Orbits, Acrobot, and Pendulum Camera—show that ViPro can outperform several baselines on challenging dynamics and support controllable predictions via symbolic interfaces and MPC integration. The results indicate that incorporating procedural knowledge reduces data requirements, improves transparency, and enables downstream control, paving the way for dynamic function libraries and neural program synthesis in future work.
Abstract
We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.
