High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion
Junhwa Hur, Charles Herrmann, Saurabh Saxena, Janne Kontkanen, Wei-Sheng Lai, Yichang Shih, Michael Rubinstein, David J. Fleet, Deqing Sun
TL;DR
This work introduces HiFI, a patch-based cascaded pixel diffusion framework for high-resolution frame interpolation that operates up to 8K resolutions with low memory overhead. By performing diffusion on patches with a shared, weight-tied model across cascade levels, HiFI achieves strong fidelity for challenging cases such as large motion and repetitive textures, while enabling base-frame estimation and upsampling within a single architecture. The method demonstrates state-of-the-art performance on high-resolution benchmarks (Xiph, X-TEST, SEPE) and on a new LaMoR dataset focused on difficult scenarios, outperforming prior diffusion and non-diffusion baselines. The results suggest notable practical impact for high-quality, high-resolution video frame interpolation, with opportunities for further efficiency via distillation and multi-frame extensions.
Abstract
Despite the recent progress, existing frame interpolation methods still struggle with processing extremely high resolution input and handling challenging cases such as repetitive textures, thin objects, and large motion. To address these issues, we introduce a patch-based cascaded pixel diffusion model for high resolution frame interpolation, HiFI, that excels in these scenarios while achieving competitive performance on standard benchmarks. Cascades, which generate a series of images from low to high resolution, can help significantly with large or complex motion that require both global context for a coarse solution and detailed context for high resolution output. However, contrary to prior work on cascaded diffusion models which perform diffusion on increasingly large resolutions, we use a single model that always performs diffusion at the same resolution and upsamples by processing patches of the inputs and the prior solution. At inference time, this drastically reduces memory usage and allows a single model, solving both frame interpolation (base model's task) and spatial up-sampling, saving training cost as well. HiFI excels at high-resolution images and complex repeated textures that require global context, achieving comparable or state-of-the-art performance on various benchmarks (Vimeo, Xiph, X-Test, and SEPE-8K). We further introduce a new dataset, LaMoR, that focuses on particularly challenging cases, and HiFI significantly outperforms other baselines. Please visit our project page for video results: https://hifi-diffusion.github.io
