Motion-aware Latent Diffusion Models for Video Frame Interpolation
Zhilin Huang, Yijie Yu, Ling Yang, Chujun Qin, Bing Zheng, Xiawu Zheng, Zikun Zhou, Yaowei Wang, Wenming Yang
TL;DR
This work tackles the motion ambiguity challenge in video frame interpolation by introducing MADiff, a motion-aware latent diffusion framework. MADiff integrates explicit inter-frame motion priors into diffusion-based generation via a novel VQ-MAGAN module and a motion-aware MA-Sampling strategy, enabling progressive refinement of interpolated frames. Empirical results across multiple benchmarks, including dynamic-texture and 4K content, demonstrate state-of-the-art perceptual quality (LPIPS, FloLPIPS, FID) and competitive fidelity metrics, outperforming both non-diffusion and diffusion-based baselines. The approach offers a flexible, modular pathway to incorporate diverse motion hints (e.g., events, flow) and highlights a trade-off between computational cost and perceptual gains, suggesting directions for acceleration in future work.
Abstract
With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.
