Table of Contents
Fetching ...

AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation

Dahyeon Kye, Changhyun Roh, Sukhun Ko, Chanho Eom, Jihyong Oh

Abstract

Video Frame Interpolation (VFI) is a core low-level vision task that synthesizes intermediate frames between existing ones while ensuring spatial and temporal coherence. Over the past decades, VFI methodologies have evolved from classical motion compensation-based approach to a wide spectrum of deep learning-based approaches, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and most recently, diffusion-based models. We introduce AceVFI, a comprehensive and up-to-date review of the VFI field, covering over 250 representative papers. We systematically categorize VFI methods based on their core design principles and architectural characteristics. Further, we classify them into two major learning paradigms: Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges in VFI, including large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We also explore VFI applications in other domains and highlight future research directions. This survey aims to serve as a valuable reference for researchers and practitioners seeking a thorough understanding of the modern VFI landscape.

AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation

Abstract

Video Frame Interpolation (VFI) is a core low-level vision task that synthesizes intermediate frames between existing ones while ensuring spatial and temporal coherence. Over the past decades, VFI methodologies have evolved from classical motion compensation-based approach to a wide spectrum of deep learning-based approaches, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and most recently, diffusion-based models. We introduce AceVFI, a comprehensive and up-to-date review of the VFI field, covering over 250 representative papers. We systematically categorize VFI methods based on their core design principles and architectural characteristics. Further, we classify them into two major learning paradigms: Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges in VFI, including large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We also explore VFI applications in other domains and highlight future research directions. This survey aims to serve as a valuable reference for researchers and practitioners seeking a thorough understanding of the modern VFI landscape.

Paper Structure

This paper contains 51 sections, 14 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: General process of VFI. Given $M$ consecutive input frames $\{I_j\}_{j=0}^{M-1}$, the VFI model $\mathcal{F}$ synthesizes one or more intermediate frames, producing an output sequence $\{\hat{I}_{t_k}\}_{k=0}^{N-1}$ with $t_k \in (\lfloor \frac{M}{2} \rfloor - 1, \lfloor \frac{M}{2} \rfloor)$, where $M \ge 2$ and $N \ge 1$.
  • Figure 2: Overview of the survey structure.
  • Figure 3: Temporal alignment strategies. Input frames $(I_0, I_1)$ or features $(F_0, F_1)$ are aligned toward target time $t$ using four strategies: (a) kernel-based, (b) flow-based, (c) attention-based, and (d) cost volume-based. On the right, (a') (kernel-based), the blue square denotes the fixed kernel support window centered at the output location in $I_t$, while the gray patches indicate the actual sampling positions in $I_{t-1}$ and $I_{t+1}$ gathered via learned offsets, so that motion is encoded implicitly through the offset pattern and kernel weights; (b') (flow-based), the light-blue marks the explicit reference location reached by a displacement vector, showing an explicit motion field; (a'+b') shows the combined design, where an explicit flow first transports the support window toward a reference region and a local kernel is then applied around that flow-guided position for refinement.
  • Figure 4: General pipeline of VFI. Given two input frames $I_0$ and $I_1$, deep features $F_0$ and $F_1$ are first extracted. The features or pixels are then temporally aligned to the target time $t$ using estimated motion, producing $\hat{F}_{0 \rightarrow t}$, $\hat{F}_{1 \rightarrow t}$ or $\hat{I}_{0 \rightarrow t}$, $\hat{I}_{1 \rightarrow t}$. A Frame Synthesis module blends the aligned inputs to produce the final frame $\hat{I}_t$.
  • Figure 5: Comparison of different convolution types. (a) Standard convolution samples at a fixed grid location $(x+k, y+l)$. (b) Deformable convolution introduces learnable offsets $(\alpha_{k,l}, \beta_{k,l})$, enabling adaptive sampling at $(x+k+\alpha_{k,l},\ y+l+\beta_{k,l})$. (c) Dynamic convolution further generalizes this by predicting the kernel weights $W_i^{k,l}(x, y)$ dynamically for each output position, allowing for spatially-variant filtering.
  • ...and 6 more figures