Table of Contents
Fetching ...

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

Junhwa Hur, Charles Herrmann, Saurabh Saxena, Janne Kontkanen, Wei-Sheng Lai, Yichang Shih, Michael Rubinstein, David J. Fleet, Deqing Sun

TL;DR

This work introduces HiFI, a patch-based cascaded pixel diffusion framework for high-resolution frame interpolation that operates up to 8K resolutions with low memory overhead. By performing diffusion on patches with a shared, weight-tied model across cascade levels, HiFI achieves strong fidelity for challenging cases such as large motion and repetitive textures, while enabling base-frame estimation and upsampling within a single architecture. The method demonstrates state-of-the-art performance on high-resolution benchmarks (Xiph, X-TEST, SEPE) and on a new LaMoR dataset focused on difficult scenarios, outperforming prior diffusion and non-diffusion baselines. The results suggest notable practical impact for high-quality, high-resolution video frame interpolation, with opportunities for further efficiency via distillation and multi-frame extensions.

Abstract

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model's task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines. Please visit our project page for video results: https://hifi-diffusion.github.io

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

TL;DR

This work introduces HiFI, a patch-based cascaded pixel diffusion framework for high-resolution frame interpolation that operates up to 8K resolutions with low memory overhead. By performing diffusion on patches with a shared, weight-tied model across cascade levels, HiFI achieves strong fidelity for challenging cases such as large motion and repetitive textures, while enabling base-frame estimation and upsampling within a single architecture. The method demonstrates state-of-the-art performance on high-resolution benchmarks (Xiph, X-TEST, SEPE) and on a new LaMoR dataset focused on difficult scenarios, outperforming prior diffusion and non-diffusion baselines. The results suggest notable practical impact for high-quality, high-resolution video frame interpolation, with opportunities for further efficiency via distillation and multi-frame extensions.

Abstract

Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model's task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines. Please visit our project page for video results: https://hifi-diffusion.github.io

Paper Structure

This paper contains 40 sections, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Qualitative comparison on challenging cases on our proposed LaMoR dataset (rows 1 and 2) and X-TEST (row 3). For challenging cases, such as large motion or repetitive textures, the proposed HiFI substantially outperforms other baselines.
  • Figure 2: Our base model is conditioned on two input frames, $\mathbf{I}_0$ and $\mathbf{I}_2$, and predicts the intermediate frame $\mathbf{I}_1$. The model uses $\bm{v}$-parameterization Salimans:2022:PDFSaxena:2023:ZSM for both model output and loss.
  • Figure 3: Patch-based cascade model. Given a low-resolution intermediate from the previous level, patch-based cascade creates patches from bi-linearly upsampled low-resolution intermediate and two input frames and uses these patches as conditioning for a diffusion process. It then combines denoised patches to form the whole image. At inference time, only a single weight-shared model is recursively used across different image scales as in \ref{['fig:overall']}. Two-stage cascade is shown for simplicity.
  • Figure 4: Upsampling strategy. Like a standard cascade, we process the image from coarse to fine, but we always denoise at the same resolution, as indicated by the red box. Details on each step of the cascade are in \ref{['fig:arch']}.
  • Figure 5: Qualitative examples for public datasets. Our method performs well even in cases of large motion and complex textures such as a thin object on the top and the plate number at the bottom.
  • ...and 8 more figures