Table of Contents
Fetching ...

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, Zuxuan Wu

TL;DR

EDEN tackles large-motion video frame interpolation by enhancing diffusion-based VFI through a transformer-based latent tokenizer and a diffusion transformer with temporal attention and start-end frame difference conditioning. It introduces a Pyramid Feature Fusion Module and multi-resolution/multi-frame interval fine-tuning to handle motion and resolution variability, and employs dual-stream context integration to better incorporate start and end frame information. The approach achieves state-of-the-art perceptual metrics on DAVIS, DAIN-HD, and SNU-FILM benchmarks, while maintaining efficiency with a minimal number of denoising steps. These advances demonstrate the potential of diffusion-based VFI to handle complex, real-world motion with improved temporal coherence and visual quality.

Abstract

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

TL;DR

EDEN tackles large-motion video frame interpolation by enhancing diffusion-based VFI through a transformer-based latent tokenizer and a diffusion transformer with temporal attention and start-end frame difference conditioning. It introduces a Pyramid Feature Fusion Module and multi-resolution/multi-frame interval fine-tuning to handle motion and resolution variability, and employs dual-stream context integration to better incorporate start and end frame information. The approach achieves state-of-the-art perceptual metrics on DAVIS, DAIN-HD, and SNU-FILM benchmarks, while maintaining efficiency with a minimal number of denoising steps. These advances demonstrate the potential of diffusion-based VFI to handle complex, real-world motion with improved temporal coherence and visual quality.

Abstract

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

Paper Structure

This paper contains 27 sections, 13 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Comparisons between existing diffusion-based methods and our proposed EDEN on large-motion scenarios.
  • Figure 2: The results of existing diffusion-based VFI methods generated from random noise and denoised latent. As is shown, the quality of the generated frame shows no significant difference.
  • Figure 3: Illustration of intermediate frame reconstruction (a) and generation (b) with the transformer block (c). First, we train a tokenizer by reconstructing the intermediate frames to obtain latent tokens with strong representational capabilities. Then, we train a diffusion transformer based on these latent tokens. During inference, we utilize the diffusion transformer to generate latent tokens from noise, which are then decoded into the intermediate frame by the tokenizer decoder. We inject the starting and ending frame information into the transformer block through temporal-attention and difference embedding.
  • Figure 4: Pyramid Feature Fusion Module design of encoder (a) and decoder (b).
  • Figure 5: Visual comparison with different methods, examples selected from DAVIS, DAIN-HD544p and SNU-FILM. Ours outperforms previous methods in both capturing the motion of multiple objects and modeling fast, nonlinear motions.
  • ...and 4 more figures