Table of Contents
Fetching ...

Video prediction using score-based conditional density estimation

Pierre-Étienne H. Fiquet, Eero P. Simoncelli

TL;DR

An implicit regression-based framework for learning and sampling the conditional density of the next frame in a video given previous observed frames is described and it is shown that sequence-to-image deep networks trained on a simple resilience-to-noise objective function extract adaptive representations for temporal prediction.

Abstract

Temporal prediction is inherently uncertain, but representing the ambiguity in natural image sequences is a challenging high-dimensional probabilistic inference problem. For natural scenes, the curse of dimensionality renders explicit density estimation statistically and computationally intractable. Here, we describe an implicit regression-based framework for learning and sampling the conditional density of the next frame in a video given previous observed frames. We show that sequence-to-image deep networks trained on a simple resilience-to-noise objective function extract adaptive representations for temporal prediction. Synthetic experiments demonstrate that this score-based framework can handle occlusion boundaries: unlike classical methods that average over bifurcating temporal trajectories, it chooses among likely trajectories, selecting more probable options with higher frequency. Furthermore, analysis of networks trained on natural image sequences reveals that the representation automatically weights predictive evidence by its reliability, which is a hallmark of statistical inference

Video prediction using score-based conditional density estimation

TL;DR

An implicit regression-based framework for learning and sampling the conditional density of the next frame in a video given previous observed frames is described and it is shown that sequence-to-image deep networks trained on a simple resilience-to-noise objective function extract adaptive representations for temporal prediction.

Abstract

Temporal prediction is inherently uncertain, but representing the ambiguity in natural image sequences is a challenging high-dimensional probabilistic inference problem. For natural scenes, the curse of dimensionality renders explicit density estimation statistically and computationally intractable. Here, we describe an implicit regression-based framework for learning and sampling the conditional density of the next frame in a video given previous observed frames. We show that sequence-to-image deep networks trained on a simple resilience-to-noise objective function extract adaptive representations for temporal prediction. Synthetic experiments demonstrate that this score-based framework can handle occlusion boundaries: unlike classical methods that average over bifurcating temporal trajectories, it chooses among likely trajectories, selecting more probable options with higher frequency. Furthermore, analysis of networks trained on natural image sequences reveals that the representation automatically weights predictive evidence by its reliability, which is a hallmark of statistical inference

Paper Structure

This paper contains 31 sections, 17 equations, 12 figures.

Figures (12)

  • Figure 1: Modeling frameworks for probabilistic prediction. Both the classical and the proposed approach are trained unsupervised on image sequences and learn the distribution of the next frame conditioned on the recent past. Both approaches make predictions by sampling from this learned conditional density, but the density can be either explicit or implicit. Left. Traditional framework: the model is explicit and its parameters are learned through repeated iterations of inference and sampling. The learning objective is applied end-to-end: from past frames, to inferred latent representation, to generated next frame. Right. Proposed score-based framework: train a denoiser via regression, i.e., a mapping from past frames and noisy observation to an estimate of the clean next frame. The trained denoiser approximates the score function and implicitly represents the probabilistic model (denoted by a funnel). To sample from this implicit model, iterative partial denoising gradually transforms noise into a predicted next frame.
  • Figure 2: Moving leaves dataset. Example image sequences from our synthetic dataset. Each sequence contains two disks moving along smooth curves and occluding each other. The disks move against a blank background and the larger disk always occludes the smaller one.
  • Figure 3: Samples around ambiguous occlusion boundary. Predicting an ambiguous disk occlusion. Top. Two conditioning frames contain disks of equal size moving towards each other. The network observes these conditioning frames and pure noise in place of next frame (all input frames are highlighted in red). The network combines these observations to produce an estimated next frame (highlighted in green). Middle. Intermediate steps of the iterative denoising procedure, this score-based sampling algorithm uses the same conditional denoiser network as above but takes partial denoising steps. The corresponding sampled next-frame is highlighted in blue. Bottom. Example samples of probable next-frame generated using the iterative partial denoising procedure starting from different random initializations.
  • Figure 4: Decisions preceding occlusion events. Frequency with which sampled next-frames predict right disk occlusion, as a function of the difference in radius (which indicates depth) between the two disks (averaged over 64 samples). Highlighted red points correspond to the unambiguous examples from Figure \ref{['fig:occlusion-samples']} and the ambiguous example from Figure \ref{['fig:ambiguous-samples']}. The orange curve is a logistic function fit to the choice data, it describes the network's sensitivity to the disk's relative size difference.
  • Figure 5: Recursive generation of moving disk sequences. Coherent sequences can be generated via recursive sampling, but not via recursive estimation. In all three examples the first two frames come from the test dataset and the successive frames are generated recursively, using the previous two frames as conditioning. Top. Frames are generated by direct application of the denoiser. For the first step of this recursive process, the inputs to the conditional denoiser are highlighted in red (the noise observation, $y_{t+1}$, is not shown) and the estimated next frame is highlighted in green. Middle. Frames are generated by sampling (using iterative partial denoising). The generated next frame (highlighted in blue) is sampled from the conditional distribution of probable next frame. Bottom. Another example of recursive sampling. The fourth step in the recursive sampling process is highlighted and shows that after full occlusion of the small disk, its size, color and location are often altered.
  • ...and 7 more figures