Table of Contents
Fetching ...

Time-adaptive Video Frame Interpolation based on Residual Diffusion

Victor Fonte Chavez, Claudia Esteves, Jean-Bernard Hayet

TL;DR

The paper tackles video frame interpolation in traditional hand-drawn animation by introducing a time-adaptive diffusion model that explicitly conditions on the interpolation time $\tau$ and re-estimates it during training. Building on a ResShift-inspired diffusion, it enables efficient sampling with about 10 steps and handles two input frames via a multiple-input residual diffusion framework, while also providing pixel-wise uncertainty through the diffusion process. A three-module deep learning pipeline (Feature Extraction, Softmax Splatting Warping, and U-Net Synthesizer) with edge-aware features and a learned temporal guidance term achieves state-of-the-art results on the ATD-12K animation dataset, outperforming baselines on perceptual metrics. The work also introduces $\tau_{IFD}$ to supervise temporal positioning and demonstrates the practical potential of uncertainty estimates for aiding animators in frame corrections.

Abstract

In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.

Time-adaptive Video Frame Interpolation based on Residual Diffusion

TL;DR

The paper tackles video frame interpolation in traditional hand-drawn animation by introducing a time-adaptive diffusion model that explicitly conditions on the interpolation time and re-estimates it during training. Building on a ResShift-inspired diffusion, it enables efficient sampling with about 10 steps and handles two input frames via a multiple-input residual diffusion framework, while also providing pixel-wise uncertainty through the diffusion process. A three-module deep learning pipeline (Feature Extraction, Softmax Splatting Warping, and U-Net Synthesizer) with edge-aware features and a learned temporal guidance term achieves state-of-the-art results on the ATD-12K animation dataset, outperforming baselines on perceptual metrics. The work also introduces to supervise temporal positioning and demonstrates the practical potential of uncertainty estimates for aiding animators in frame corrections.

Abstract

In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.

Paper Structure

This paper contains 23 sections, 35 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Zoomed-in qualitative comparison on a challenging interpolation scenario. From left to right: initial frame $\mathbf I_0$, final frame $\mathbf I_1$, ground truth (cropped), SoftSplat prediction, and our result. Note the quality of the reconstructed fingers for the different methods.
  • Figure 2: Distribution of $\tau$ for photorealistic images (left) and animated images (right). The histograms show a concentration of values around $0.5$ in both categories, with a much higher dispersion in the animated images.
  • Figure 3: General overview of the proposed model.
  • Figure 4: Initial warping: Based on Softmax Splatting SoftmaxSplating_original, we produce two initial versions of the intermediate images $\mathbf{\hat{I}}_{0 \rightarrow \tau}$ and $\mathbf{\hat{I}}_{\tau \rightarrow 1}$.
  • Figure 5: Architecture of the proposed U-Net based synthesizer. Note that the diffusion timestep $t$ is passed to all the intermediate levels.
  • ...and 3 more figures