Table of Contents
Fetching ...

Motion-aware Latent Diffusion Models for Video Frame Interpolation

Zhilin Huang, Yijie Yu, Ling Yang, Chujun Qin, Bing Zheng, Xiawu Zheng, Zikun Zhou, Yaowei Wang, Wenming Yang

TL;DR

This work tackles the motion ambiguity challenge in video frame interpolation by introducing MADiff, a motion-aware latent diffusion framework. MADiff integrates explicit inter-frame motion priors into diffusion-based generation via a novel VQ-MAGAN module and a motion-aware MA-Sampling strategy, enabling progressive refinement of interpolated frames. Empirical results across multiple benchmarks, including dynamic-texture and 4K content, demonstrate state-of-the-art perceptual quality (LPIPS, FloLPIPS, FID) and competitive fidelity metrics, outperforming both non-diffusion and diffusion-based baselines. The approach offers a flexible, modular pathway to incorporate diverse motion hints (e.g., events, flow) and highlights a trade-off between computational cost and perceptual gains, suggesting directions for acceleration in future work.

Abstract

With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.

Motion-aware Latent Diffusion Models for Video Frame Interpolation

TL;DR

This work tackles the motion ambiguity challenge in video frame interpolation by introducing MADiff, a motion-aware latent diffusion framework. MADiff integrates explicit inter-frame motion priors into diffusion-based generation via a novel VQ-MAGAN module and a motion-aware MA-Sampling strategy, enabling progressive refinement of interpolated frames. Empirical results across multiple benchmarks, including dynamic-texture and 4K content, demonstrate state-of-the-art perceptual quality (LPIPS, FloLPIPS, FID) and competitive fidelity metrics, outperforming both non-diffusion and diffusion-based baselines. The approach offers a flexible, modular pathway to incorporate diverse motion hints (e.g., events, flow) and highlights a trade-off between computational cost and perceptual gains, suggesting directions for acceleration in future work.

Abstract

With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.
Paper Structure (39 sections, 11 equations, 4 figures, 8 tables, 4 algorithms)

This paper contains 39 sections, 11 equations, 4 figures, 8 tables, 4 algorithms.

Figures (4)

  • Figure 1: Overview of the diffusion processes in MADiff. The encoder and decoder enable projection between image and latent spaces, and the diffusion processes take place in the latent space. $I2E$ denotes image-to-event generator zhu2021eventgan which have capability of generating event volume by taking two continuous frames as input. And $m$ denotes motion hints extracted from $I2E$.
  • Figure 2: The architecture of the vector quantized motion-aware generative adversarial network, VQ-MAGAN. In practice, the motion extractor is image-to-event generator zhu2021eventgan, $m_{i\shortrightarrow j}$ denotes the inter-frame motion hints between frame $i$ and $j$.
  • Figure 3: Visual examples of frames interpolated by the state-of-the-art methods and the proposed MADiff. Under large and complex motions, our method preserves the most high-frequency details, delivering superior perceptual quality.
  • Figure 4: More visual examples of frames interpolated by the state-of-the-art methods and the proposed MADiff. Under large and complex motions, our method preserves the most high-frequency details, delivering superior perceptual quality.