Table of Contents
Fetching ...

Video Interpolation with Diffusion Models

Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hołyński, Ben Poole, Janne Kontkanen

TL;DR

This work introduces VIDIM, a cascaded diffusion framework for video interpolation conditioned on start and end frames. It combines a base diffusion model operating on 64×64 frames to generate intermediate content with a subsequent super-resolution diffusion model that upscales to 256×256, both conditioned without increasing parameter count via a parameter-free conditioning scheme and enhanced by classifier-free guidance on conditioning frames. The approach achieves state-of-the-art performance on challenging, large-motion interpolation tasks, validated by quantitative metrics (FVD, FID, LPIPS) and a human study, and demonstrates scalable training with memory-efficient diffusion components and model upscaling. The results suggest a broadly useful paradigm for conditional video generation, with potential extensions to video restoration, extrapolation, and other conditioning-enabled generation tasks.

Abstract

We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.

Video Interpolation with Diffusion Models

TL;DR

This work introduces VIDIM, a cascaded diffusion framework for video interpolation conditioned on start and end frames. It combines a base diffusion model operating on 64×64 frames to generate intermediate content with a subsequent super-resolution diffusion model that upscales to 256×256, both conditioned without increasing parameter count via a parameter-free conditioning scheme and enhanced by classifier-free guidance on conditioning frames. The approach achieves state-of-the-art performance on challenging, large-motion interpolation tasks, validated by quantitative metrics (FVD, FID, LPIPS) and a human study, and demonstrates scalable training with memory-efficient diffusion components and model upscaling. The results suggest a broadly useful paradigm for conditional video generation, with potential extensions to video restoration, extrapolation, and other conditioning-enabled generation tasks.

Abstract

We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.
Paper Structure (16 sections, 3 equations, 9 figures, 3 tables)

This paper contains 16 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Frame interpolation for very large and ambiguous motion. The middle frame of an interpolated video with FILMreda2022film, RIFE rife, LDMVFI danier2023ldmvfi and AMTamt shows large blurry artifacts. VIDIM, however, is able to recover a plausible output frame. Note that due to the ambiguity of the problem, VIDIM's output is not always similar to the ground truth (especially clear in the top example), but corresponds to a different choice of motion. See the https://vidim-interpolation.github.io/ for video outputs.
  • Figure 2: Two examples from DAVIS-9 dataset, showing the predicted in-between frames. Top: The break-dancer example demonstrates highly ambiguous motion. Our method can produce plausible video with sharp details whereas the baselines reda2022filmrifeamt trained with regression objective resort into predict blurry images. Bottom: On a very large motion with significant perspective change on the dirt bike, the baselines fail to reconstruct sharp results, where as our method produces sharp results with plausible motion.
  • Figure 3: Human evaluation results on Davis-7, showing how often VIDIM and each baseline was preferred by human raters.
  • Figure 4: Sample comparison between our VIDIM medium super-resolution model (top) and an identically trained baseline minus high-resolution frame conditioning.
  • Figure 5: FID scores comparison between VIDIM and an inpainting baseline model at different guidance and reconstruction guidance weights, respectively. Note that the reconstruction guidance weights (x-axis) for the baseline are re-scaled via $f(w)=(w-1)/13 + 1$ to more easily compare scores at the optimal region to VIDIM; the true range for the baseline guidance weights is from 1 to 27. The baseline model achieves an FID score of 60.11.
  • ...and 4 more figures