Table of Contents
Fetching ...

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, Steven M. Seitz

TL;DR

The paper tackles keyframe interpolation by adapting a pretrained image-to-video diffusion model (Stable Video Diffusion) to generate cohesive in-between frames. It introduces a lightweight backward-motion fine-tuning that leverages 180-degree rotated temporal self-attention maps and a dual-directional sampling strategy that fuses forward and backward motion paths to ensure motion consistency. Experiments on Davis and Pexels demonstrate superior quality over state-of-the-art frame interpolation baselines and related diffusion methods, particularly for distant keyframes. The approach achieves high-resolution outputs with coherent dynamics while requiring minimal additional training data and only a small fraction of model parameters to be updated. Overall, it offers a practical, resource-efficient solution for high-quality keyframe inbetweening using existing large-scale video diffusion priors.

Abstract

We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

TL;DR

The paper tackles keyframe interpolation by adapting a pretrained image-to-video diffusion model (Stable Video Diffusion) to generate cohesive in-between frames. It introduces a lightweight backward-motion fine-tuning that leverages 180-degree rotated temporal self-attention maps and a dual-directional sampling strategy that fuses forward and backward motion paths to ensure motion consistency. Experiments on Davis and Pexels demonstrate superior quality over state-of-the-art frame interpolation baselines and related diffusion methods, particularly for distant keyframes. The approach achieves high-resolution outputs with coherent dynamics while requiring minimal additional training data and only a small fraction of model parameters to be updated. Overall, it offers a practical, resource-efficient solution for high-quality keyframe inbetweening using existing large-scale video diffusion priors.

Abstract

We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.
Paper Structure (21 sections, 3 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 3 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Method overview. In the lightweight backward motion fine-tuning stage, an input video $\mathbf{x}=\{I_0, I_1, ..., I_{N-1}\}$ is encoded into the latent space by $\mathcal{E}(\mathbf{x})$, and noise is added to create noisy latent $\mathbf{z}_{t}$; during inference, $\mathbf{z}_{t}$ is created by iterative denoising starting from $\mathbf{z}_T\sim\mathcal{N}(\mathbf{0, I})$. (1) Forward motion prediction: we first take the conditioning $\mathbf{c}_0$ of the first input image (inference stage) or the first frame in the video (training stage) $I_0$, along with the noisy latent $\mathbf{z}_t$ to feed into the pre-trained 3D U-Net $f_{\theta}$ to get the noise predictions $\hat{\mathbf{v}}_{t, 0}$, as well as the temporal self attention maps $\{A_i\}$. (2) Backward motion prediction: We reverse the noisy latent $\mathbf{z}_{t}$ along temporal axis to get $\mathbf{z}'_{t}$. Then we take the conditioning $\mathbf{c}_{N-1}$ of the second input image, or the last frame in the video $I_{N-1}$, along with the 180-degree rotated temporal self-attention maps $\{A'_i\}$, and feed them through the fine-tuned 3D U-Net $f_{\theta'}$ for backward motion prediction $\hat{\mathbf{v}}_{t, 1}$. (3) Fuse and update: The predicted backward motion noise is reversed again to fuse with the forward motion noise to create consistent motion path. Note that only the value and output projection matrices $W_{\{v, o\}}$ in the temporal self-attention layers ( green) are fine-tuned; see Fig. \ref{['fig:attnmap']} for more details.
  • Figure 2: Temporal self-attention module in the backward motion generation. Given input tensor $X$, our attention mechanism additionally takes the respective attention map $A$ from the pre-trained SVD featuring forward motion, rotating it by 180 degrees to create a reverse motion-time association $A'$. Note that $W_{\{v, o\}}$ are the only trainable parameters in this module.
  • Figure 3: Qualitative baseline comparisons. Leftmost ($i=0$) and rightmost columns ($i=24$): start and end frames. TRF generates back-and-forth motions, such as vehicles moving forward and then reversing. FILM struggles to find correspondences when the input frames are distant and morphs from the first frame to the last. The red arrow indicates the direction of motion. We recommend viewing the supplementary videos.
  • Figure 4: Ablation study. We evaluate other options for generating in-between motion consistency. (1) Ours w/o RA: full pipeline with fine-tuning all parameters $W_{\{q,k,v,o\}}$ in the temporal attention layers but without using 180-degree rotated temporal self-attention maps as extra input (top row). (2) Ours w/o FT: full pipeline without fine-tuning $W_{\{v, o\}}$ for backward motion (second row). The differences are highlighted in the red rectangle.
  • Figure 5: Our method outperforms FILM and TRF in generating articulated movements inbetween, but still struggles to create natural kinematic motions because of the limitation of SVD itself failing to generated complex kinematics (bottom row). Note that the input image serve as conditioning to SVD, so generated first frame might differ from the input image if SVD struggles to create plausible videos from that input.
  • ...and 4 more figures