Table of Contents
Fetching ...

FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation

Ganggui Ding, Hao Chen, Xiaogang Xu

TL;DR

FC-VFI is proposed for faithful and consistent video frame interpolation, supporting \(4\times x and \(8\times\) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at \(2560\times 1440\)resolution while preserving visual fidelity and motion consistency.

Abstract

Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting \(4\times\)x and \(8\times\) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at \(2560\times 1440\)resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.

FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation

TL;DR

FC-VFI is proposed for faithful and consistent video frame interpolation, supporting interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at resolution while preserving visual fidelity and motion consistency.

Abstract

Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting x and interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.
Paper Structure (17 sections, 7 equations, 6 figures, 3 tables)

This paper contains 17 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of FC-VFI's training pipeline. (a) We model temporal references by concatenating the noisy latent $\mathbf{z}_t^n$ with the start and end image latents $\mathbf{z}_s$ and $\mathbf{z}_e$ along the temporal dimension, enabling the denoising process to reference both boundaries. (b) We apply fidelity modulation by performing a timestep-dependent modulation $t^*$ into $\mathbf{z}_s$ and $\mathbf{z}_e$, which enhances reference stability. (c) Semantic matching lines featrues $\mathbf{c}_s$ and $\mathbf{c}_e$ are extracted and encoded from the start frame $\mathbf{I}_s$ and end frame $\mathbf{I}_e$, and are element-wise added to $\mathbf{z}_s$ and $\mathbf{z}_e$, resulting in enhanced latents $\mathbf{z}_s'$ and $\mathbf{z}_e'$. These are then processed via a copied DiT block to produce $\mathbf{z}_{\text{res}}^n$, which is injected back into the main backbone. (d) The prediction $\hat{\mathbf{v}}_t^n$ is supervised with a temporal difference loss $\mathcal{L}_{\text{temp}}$.
  • Figure 2: Qualitative comparison of interpolation results. (Top) Comparison with GIMM-VFI guo2024generalizable on DAVIS-2017 pont20172017 at $2560 \times 1440$ resolution under $8\times$ interpolation. Ours better handles challenging conditions such as high-contrast lighting, small objects, and occlusion, avoiding artifacts like ghosting and structural distortion. (Bottom) Comparison with diffusion-based methods (GI wang2024generative, ViBiDSampler yang2024vibidsampler, FCVG zhu2025generative) on X-Test sim2021xvfi and DAVIS-2017 at $1024 \times 576$ resolution under $8\times$ interpolation. FC-VFI preserves finer details (e.g., text, license plates, building textures), while other methods suffer from motion ambiguity and temporal artifacts.
  • Figure 3: Ablation results of our method visualized under different module configurations. The displayed intermediate frame is closer to the end frame.
  • Figure 4: Additional qualitative comparison of interpolation results. (Top) Visual comparisons with GIMM-VFI guo2024generalizable at $2560 \times 1440$ resolution under $8\times$ interpolation. Tested on additional challenging scenes, our FC-VFI effectively suppresses structural distortion and ghosting artifacts. (Bottom) Visual comparisons with recent diffusion-based methods (GI wang2024generative, ViBiDSampler yang2024vibidsampler, and FCVG zhu2025generative) at $1024 \times 576$ resolution under $8\times$ interpolation. FC-VFI consistently demonstrates superior capability in recovering fine details (e.g., complex boundaries and textual patterns) and maintaining temporal consistency compared to the baselines (Sec. \ref{['sec:appendix_qualitative']}).
  • Figure 5: Comparison of Time Reversal, Channel Reference, and Temporal Reference paradigms for diffusion-based video frame interpolation (Sec. \ref{['sec_difference_with_other_methods']}).
  • ...and 1 more figures