Table of Contents
Fetching ...

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, Shuchang Zhou

TL;DR

This work tackles real-time video frame interpolation by introducing RIFE, which directly estimates intermediate optical flows with IFNet and a fusion map to synthesize target frames without relying on pre-trained flow models. A privileged distillation strategy uses access to ground-truth intermediate frames to stabilize training and boost accuracy, while a coarse-to-fine IFNet design enables efficient, end-to-end learning. Temporal encoding allows arbitrary-timestep interpolation and broadening of applications beyond fixed-timestep outputs, further aided by a lightweight RefineNet refinement stage. Empirical results on Vimeo90K, HD, and X4K-1000FPS benchmarks show state-of-the-art performance with substantial speedups over prior flow-based methods, highlighting RIFE’s practical potential for real-time VFI on devices and more flexible temporal processing tasks.

Abstract

Real-time video frame interpolation (VFI) is very useful in video processing, media players, and display devices. We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for VFI. To realize a high-quality flow-based VFI method, RIFE uses a neural network named IFNet that can estimate the intermediate flows end-to-end with much faster speed. A privileged distillation scheme is designed for stable IFNet training and improve the overall performance. RIFE does not rely on pre-trained optical flow models and can support arbitrary-timestep frame interpolation with the temporal encoding input. Experiments demonstrate that RIFE achieves state-of-the-art performance on several public benchmarks. Compared with the popular SuperSlomo and DAIN methods, RIFE is 4--27 times faster and produces better results. Furthermore, RIFE can be extended to wider applications thanks to temporal encoding. The code is available at https://github.com/megvii-research/ECCV2022-RIFE.

Real-Time Intermediate Flow Estimation for Video Frame Interpolation

TL;DR

This work tackles real-time video frame interpolation by introducing RIFE, which directly estimates intermediate optical flows with IFNet and a fusion map to synthesize target frames without relying on pre-trained flow models. A privileged distillation strategy uses access to ground-truth intermediate frames to stabilize training and boost accuracy, while a coarse-to-fine IFNet design enables efficient, end-to-end learning. Temporal encoding allows arbitrary-timestep interpolation and broadening of applications beyond fixed-timestep outputs, further aided by a lightweight RefineNet refinement stage. Empirical results on Vimeo90K, HD, and X4K-1000FPS benchmarks show state-of-the-art performance with substantial speedups over prior flow-based methods, highlighting RIFE’s practical potential for real-time VFI on devices and more flexible temporal processing tasks.

Abstract

Real-time video frame interpolation (VFI) is very useful in video processing, media players, and display devices. We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for VFI. To realize a high-quality flow-based VFI method, RIFE uses a neural network named IFNet that can estimate the intermediate flows end-to-end with much faster speed. A privileged distillation scheme is designed for stable IFNet training and improve the overall performance. RIFE does not rely on pre-trained optical flow models and can support arbitrary-timestep frame interpolation with the temporal encoding input. Experiments demonstrate that RIFE achieves state-of-the-art performance on several public benchmarks. Compared with the popular SuperSlomo and DAIN methods, RIFE is 4--27 times faster and produces better results. Furthermore, RIFE can be extended to wider applications thanks to temporal encoding. The code is available at https://github.com/megvii-research/ECCV2022-RIFE.

Paper Structure

This paper contains 20 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Performance comparison. Results are reported for Vimeo90K xue2019video and HD-$4\times$bao2019depth benchmark. More details are in the experimental section
  • Figure 2: Overview of RIFE pipeline. Given two input frames $I_0, I_1$ and temporal encoding $t$ (timestep encoded as an separate channel mnih2013playinghuang2019learning), we directly feed them into the IFNet to approximate intermediate flows $F_{t\rightarrow 0}, F_{t\rightarrow 1}$ and the fusion map $M$. During the training phase, a privileged teacher refines student's results based on ground truth $I_t$ using a special IFBlock
  • Figure 3: Compare indirect intermediate flow estimation jiang2018superxu2019quadraticbao2019depth (left) with IFNet (right). As the object shifts, flow reversal modules may have flaws in motion boundaries. Rather than hand-engineering flow reversal layers, CNNs can learn intermediate flow estimates end-to-end
  • Figure 4: Left: The IFNet is composed of several stacked IFBlocks operating at different resolution. Right: In an IFBlock, we first backward warp the two input frames based on current approximated flow $F^{i-1}$. Then the input frames $I_0, I_1$, warped frames $\widehat{I}_{t\leftarrow 0}, \widehat{I}_{t\leftarrow 1}$, the previous results $F^{i-1}, M^{i-1}$ and timestep $t$ are fed into the next IFBlock to approximate the residual of flow and mask. The privileged information $I_t^{GT}$ is only provided for teacher
  • Figure 5: Results of DVF liu2017video (Vimeo90K). After feeding the edge map of intermediate frames (privileged information) into the model, the estimated flows can be significantly improved, resulting in better reconstruction on validation set
  • ...and 8 more figures