Time-adaptive Video Frame Interpolation based on Residual Diffusion
Victor Fonte Chavez, Claudia Esteves, Jean-Bernard Hayet
TL;DR
The paper tackles video frame interpolation in traditional hand-drawn animation by introducing a time-adaptive diffusion model that explicitly conditions on the interpolation time $\tau$ and re-estimates it during training. Building on a ResShift-inspired diffusion, it enables efficient sampling with about 10 steps and handles two input frames via a multiple-input residual diffusion framework, while also providing pixel-wise uncertainty through the diffusion process. A three-module deep learning pipeline (Feature Extraction, Softmax Splatting Warping, and U-Net Synthesizer) with edge-aware features and a learned temporal guidance term achieves state-of-the-art results on the ATD-12K animation dataset, outperforming baselines on perceptual metrics. The work also introduces $\tau_{IFD}$ to supervise temporal positioning and demonstrates the practical potential of uncertainty estimates for aiding animators in frame corrections.
Abstract
In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.
