Table of Contents
Fetching ...

Hierarchical Flow Diffusion for Efficient Frame Interpolation

Yang Hai, Guo Wang, Tan Su, Wenjie Jiang, Yinlin Hu

TL;DR

This work addresses the accuracy and efficiency gap of diffusion-based video frame interpolation by introducing a hierarchical flow diffusion framework that explicitly denoises optical flow in a coarse-to-fine, multi-scale manner. A flow-guided image synthesizer, trained with pseudo bilateral flow from a pretrained model, generates the intermediate frame, while a jointly trained hierarchical diffusion model refines the flow conditioned on encoder features. The approach yields state-of-the-art interpolation quality and over 10x faster inference than prior diffusion-based methods, with competitive memory usage, enabling practical high-resolution interpolations. The combination of explicit flow modeling, multiscale conditioning, and end-to-end fine-tuning offers a scalable and effective solution for handling large motions and complex scenes.

Abstract

Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods. The project page is at: https://hfd-interpolation.github.io.

Hierarchical Flow Diffusion for Efficient Frame Interpolation

TL;DR

This work addresses the accuracy and efficiency gap of diffusion-based video frame interpolation by introducing a hierarchical flow diffusion framework that explicitly denoises optical flow in a coarse-to-fine, multi-scale manner. A flow-guided image synthesizer, trained with pseudo bilateral flow from a pretrained model, generates the intermediate frame, while a jointly trained hierarchical diffusion model refines the flow conditioned on encoder features. The approach yields state-of-the-art interpolation quality and over 10x faster inference than prior diffusion-based methods, with competitive memory usage, enabling practical high-resolution interpolations. The combination of explicit flow modeling, multiscale conditioning, and end-to-end fine-tuning offers a scalable and effective solution for handling large motions and complex scenes.

Abstract

Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods. The project page is at: https://hfd-interpolation.github.io.

Paper Structure

This paper contains 10 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Different methods for video frame interpolation. Most diffusion-based ho2020ddpmsong2021ddim interpolation methods (LDMVFI danier2024ldmvfi, CBBD lyu2024cbbd) still have a large gap from non-diffusion-based methods (SGM-VFI liu2024sgm-vfi), in both accuracy and efficiency. We propose a diffusion-based model that is 10+ times faster than other diffusion-based methods, and on par with SGM-VFI in efficiency. More importantly, we achieve significantly better accuracy than all baselines. Note how the details and large motions are missed in the baselines, but recovered with our method. We report the inference seconds on the same RTX-4090 GPU with a typical 1024$\times$1024 image pair.
  • Figure 2: Different strategies with diffusion models for video frame interpolation. Given an image pair ($I_0$, $I_1$), our goal is to predict the intermediate frame $\tilde{I}_t$. (a) Most diffusion-based methods danier2024ldmvfilyu2024cbbdhuang2024madiff formulate the problem as a denoising process in the latent space ($\tilde{F}_t$) directly, and train the diffusion network and the encode-decoder ("E" and "D") network separately. This strategy is less effective caused by the large latent space. On the other hand, this method cannot handle complex motions and large displacement. (b) We use a hierarchical strategy with explicit flow modeling. We first train a flow based encoder-decoder for image synthesizer with image pairs and the ground truth optical flow. Then, unlike most diffusion-based methods that denoise the latent space directly, we use a hierarchical diffusion model, conditioned on the encoder feature ($F_0$, $F_1$), to explicitly denoise optical flow from coarse to fine. We use the predicted bilateral flow ($\tilde{f}_0$, $\tilde{f}_1$) to warp image features for the synthesizer, and finally fine-tune the synthesizer and the diffusion models jointly.
  • Figure 3: Overview of our method. We first construct a flow-guided encoder-decoder with multiscale features as our image synthesizer, and then use diffusion to explicitly denoise optical flow in a coarse-to-fine manner, where the diffusion on each level will be conditioned on encoder features from the corresponding level. With the predicted intermediate optical flow, we use the flow to warp encoder features on each level, and use a multiscale decoder to synthesize the final target image.
  • Figure 4: Illustration of flow-guided image synthesis. We train a multiscale encoder-decoder as our image synthesizer based on image pairs ($I_0$, $I_1$) and bilateral optical flow ($\tilde{f}_0$, $\tilde{f}_1$).
  • Figure 5: Results of the hierarchical models on different scales. We show the coarse-to-fine results from left to right in addition to the input and ground truth. With the proposed hierarchical diffusion models, the result becomes progressively better with finer resolution, making it capable of handling complex motions and large displacements.
  • ...and 4 more figures