Time-adaptive Video Frame Interpolation based on Residual Diffusion

Victor Fonte Chavez; Claudia Esteves; Jean-Bernard Hayet

Time-adaptive Video Frame Interpolation based on Residual Diffusion

Victor Fonte Chavez, Claudia Esteves, Jean-Bernard Hayet

TL;DR

The paper tackles video frame interpolation in traditional hand-drawn animation by introducing a time-adaptive diffusion model that explicitly conditions on the interpolation time $\tau$ and re-estimates it during training. Building on a ResShift-inspired diffusion, it enables efficient sampling with about 10 steps and handles two input frames via a multiple-input residual diffusion framework, while also providing pixel-wise uncertainty through the diffusion process. A three-module deep learning pipeline (Feature Extraction, Softmax Splatting Warping, and U-Net Synthesizer) with edge-aware features and a learned temporal guidance term achieves state-of-the-art results on the ATD-12K animation dataset, outperforming baselines on perceptual metrics. The work also introduces $\tau_{IFD}$ to supervise temporal positioning and demonstrates the practical potential of uncertainty estimates for aiding animators in frame corrections.

Abstract

In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.

Time-adaptive Video Frame Interpolation based on Residual Diffusion

TL;DR

The paper tackles video frame interpolation in traditional hand-drawn animation by introducing a time-adaptive diffusion model that explicitly conditions on the interpolation time

and re-estimates it during training. Building on a ResShift-inspired diffusion, it enables efficient sampling with about 10 steps and handles two input frames via a multiple-input residual diffusion framework, while also providing pixel-wise uncertainty through the diffusion process. A three-module deep learning pipeline (Feature Extraction, Softmax Splatting Warping, and U-Net Synthesizer) with edge-aware features and a learned temporal guidance term achieves state-of-the-art results on the ATD-12K animation dataset, outperforming baselines on perceptual metrics. The work also introduces

to supervise temporal positioning and demonstrates the practical potential of uncertainty estimates for aiding animators in frame corrections.

Time-adaptive Video Frame Interpolation based on Residual Diffusion

TL;DR

Abstract

Time-adaptive Video Frame Interpolation based on Residual Diffusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)