Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction

Gaurav Shrivastava; Abhinav Shrivastava

Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction

Gaurav Shrivastava, Abhinav Shrivastava

TL;DR

This work tackles video prediction by reframing video as a continuous multi-dimensional process rather than a sequence of discrete frames. The Continuous Video Process (CVP) defines an explicit forward path between consecutive frames with $\mathbf{x}_t = (1-t)\mathbf{x} + t\mathbf{y} - \frac{t\log(t)}{\sqrt{2}}\mathbf{z}$ and learns a reverse Gaussian process via a variational bound, achieving $75\%$ fewer sampling steps during inference. It conducts extensive experiments on KTH, BAIR, Human3.6M, and UCF101, showing state-of-the-art video prediction while eliminating external temporal constraints like temporal attention. The approach uses a U-Net backbone, a specialized noise schedule $g(t) = -t\log(t)$, and a simple training loss, enabling efficient, high-fidelity frame synthesis with competitive metrics such as FVD. The work also provides ablations on noise schedules and sampling steps and discusses limitations and broader societal impacts of advanced video synthesis.

Abstract

Diffusion models have made significant strides in image generation, mastering tasks such as unconditional image synthesis, text-image translation, and image-to-image conversions. However, their capability falls short in the realm of video prediction, mainly because they treat videos as a collection of independent images, relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper, we introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames. We also report a reduction of 75\% sampling steps required to sample a new frame thus making our framework more efficient during the inference time. Through extensive experimentation, we establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project page https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.

Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction

TL;DR

Abstract

Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)