Video Interpolation with Diffusion Models
Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hołyński, Ben Poole, Janne Kontkanen
TL;DR
This work introduces VIDIM, a cascaded diffusion framework for video interpolation conditioned on start and end frames. It combines a base diffusion model operating on 64×64 frames to generate intermediate content with a subsequent super-resolution diffusion model that upscales to 256×256, both conditioned without increasing parameter count via a parameter-free conditioning scheme and enhanced by classifier-free guidance on conditioning frames. The approach achieves state-of-the-art performance on challenging, large-motion interpolation tasks, validated by quantitative metrics (FVD, FID, LPIPS) and a human study, and demonstrates scalable training with memory-efficient diffusion components and model upscaling. The results suggest a broadly useful paradigm for conditional video generation, with potential extensions to video restoration, extrapolation, and other conditioning-enabled generation tasks.
Abstract
We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.
