Improved Conditional VRNNs for Video Prediction
Lluis Castrejon, Nicolas Ballas, Aaron Courville
TL;DR
This work tackles video prediction by enhancing variational latent-variable models with a hierarchical VRNN that uses dense latent connectivity and a high-capacity likelihood decoder. By increasing both the expressiveness of latent distributions and the capacity of the likelihood model, the approach mitigates blurriness and better captures multi-modal future dynamics. Across BAIR Push, Cityscapes, and Stochastic Moving MNIST, the method achieves strong results, including significant FVD and LPIPS improvements over SVG-LP baselines and competitive performance with SAVP, with ablations confirming the value of deeper likelihoods and latent hierarchies. The findings suggest that current VRNN-based models underfit and that larger, more flexible generative models can substantially improve video forecasting quality and realism.
Abstract
Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.
