Table of Contents
Fetching ...

Improved Conditional VRNNs for Video Prediction

Lluis Castrejon, Nicolas Ballas, Aaron Courville

TL;DR

This work tackles video prediction by enhancing variational latent-variable models with a hierarchical VRNN that uses dense latent connectivity and a high-capacity likelihood decoder. By increasing both the expressiveness of latent distributions and the capacity of the likelihood model, the approach mitigates blurriness and better captures multi-modal future dynamics. Across BAIR Push, Cityscapes, and Stochastic Moving MNIST, the method achieves strong results, including significant FVD and LPIPS improvements over SVG-LP baselines and competitive performance with SAVP, with ablations confirming the value of deeper likelihoods and latent hierarchies. The findings suggest that current VRNN-based models underfit and that larger, more flexible generative models can substantially improve video forecasting quality and realism.

Abstract

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.

Improved Conditional VRNNs for Video Prediction

TL;DR

This work tackles video prediction by enhancing variational latent-variable models with a hierarchical VRNN that uses dense latent connectivity and a high-capacity likelihood decoder. By increasing both the expressiveness of latent distributions and the capacity of the likelihood model, the approach mitigates blurriness and better captures multi-modal future dynamics. Across BAIR Push, Cityscapes, and Stochastic Moving MNIST, the method achieves strong results, including significant FVD and LPIPS improvements over SVG-LP baselines and competitive performance with SAVP, with ablations confirming the value of deeper likelihoods and latent hierarchies. The findings suggest that current VRNN-based models underfit and that larger, more flexible generative models can substantially improve video forecasting quality and realism.

Abstract

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.

Paper Structure

This paper contains 16 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Can generative models predict the future? We propose an improved VAE model for video prediction. Our model uses hierarchical latents and a higher capacity likelihood network to improve upon previous VAE approaches, generating more visually appealing samples that remain coherent for longer temporal horizons.
  • Figure 2: Graphical model for the learned prior with the dense latent connectivity pattern. Arrows in red show the connections from the input at the previous timestep to current latent variables. Arrows in green highlight skip connections between latent variables and connections to outputs. Arrows in black indicate recurrent temporal connections. We empirically observe that this dense-connectivity pattern eases the training of latent hierarchies.
  • Figure 3: Model Parametrization. Our model uses a CNN to encode frames individually. The representation of the context frames is used to initialize the states of the prior, posterior and likelihood networks, all of which use recurrent networks. At each timestep, the decoder receives an encoding of the previous frame, a set of latent variables (either from the prior or the posterior) and its previous hidden state and predicts the next frame in the sequence.
  • Figure 4: Average normalized KL per latent channel. We visualize the mean normalized KL for each latent channel for models from Table \ref{['tab:latent_hierarchy']}. Without beta warmup and dense connectivity the hierarchy of latents is underutilized, with most information being encoded in a few latents of the top level. In contrast, the same model with these techniques utilizes all latent levels.
  • Figure 5: Selected Samples for BAIR Push and Cityscapes. We show a sequence for BAIR Push and Cityscapes and random generations from our model and baselines. On BAIR Push we observe that the SAVP predictions are crisp but sometimes depict inconsistent arm-object interactions. SVG-LP produces blurry predictions in uncertain areas such as occluded parts of the background or those showing object interactions. Our model generates plausible interactions with reduced blurriness relatively to SVG-LP. On Cityscapes, the SVG-LP baseline is unable to model any motion. Our model, using a hierarchy of latents, generates more visually compelling predictions. More samples can be found in the Appendix.
  • ...and 2 more figures