LADDER: An Efficient Framework for Video Frame Interpolation

Tong Shen; Dong Li; Ziheng Gao; Lu Tian; Emad Barsoum

LADDER: An Efficient Framework for Video Frame Interpolation

Tong Shen, Dong Li, Ziheng Gao, Lu Tian, Emad Barsoum

TL;DR

This work tackles video frame interpolation (VFI) with a focus on balancing efficiency and quality. It introduces LADDER, a framework that combines a flow estimator using large-kernel depth-wise convolutions with a decoder-only refinement module and an HD-aware augmentation strategy to improve performance on HD frames. Experiments on Vimeo90K, UCF101, Xiph, and SNU-FILM show state-of-the-art results while substantially reducing FLOPs and parameter counts, thanks to careful architectural choices and training procedures. Ablation studies validate the contributions of large-kernel flow estimation, decoder-only refinement, HD-aware augmentation, and a two-stage training regime, suggesting this approach as a strong, practical baseline for efficient VFI.

Abstract

Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc. This paper introduces an efficient video frame interpolation framework that aims to strike a favorable balance between efficiency and quality. Our framework follows a general paradigm consisting of a flow estimator and a refinement module, while incorporating carefully designed components. First of all, we adopt depth-wise convolution with large kernels in the flow estimator that simultaneously reduces the parameters and enhances the receptive field for encoding rich context and handling complex motion. Secondly, diverging from a common design for the refinement module with a UNet-structure (encoder-decoder structure), which we find redundant, our decoder-only refinement module directly enhances the result from coarse to fine features, offering a more efficient process. In addition, to address the challenge of handling high-definition frames, we also introduce an innovative HD-aware augmentation strategy during training, leading to consistent enhancement on HD images. Extensive experiments are conducted on diverse datasets, Vimeo90K, UCF101, Xiph and SNU-FILM. The results demonstrate that our approach achieves state-of-the-art performance with clear improvement while requiring much less FLOPs and parameters, reaching to a better spot for balancing efficiency and quality.

LADDER: An Efficient Framework for Video Frame Interpolation

TL;DR

Abstract

Paper Structure (9 sections, 11 equations, 3 figures, 6 tables)

This paper contains 9 sections, 11 equations, 3 figures, 6 tables.

Introduction
Related Work
Method
Training Objectives
Experiments
Implementation Details
Comparisons with Other Methods
Ablation Study
Conclusion

Figures (3)

Figure 1: Model comparison on Vimeo90K dataset in terms of FLOPs, PSNR and number of parameters. We roughly divide the models into light-weight and large models, which are shown with green color and orange color respectively. The marker size indicates the number of parameters ranging from 1.4M to 60+M. We present two versions of our method, which are colored red. For both light-weight and large groups, we achieve higher PSNR with much lower complexity, reaching to a better balancing spot.
Figure 2: Illustration of our pipeline. Different processes are represented by colored lines. The images are first fed into the feature extractor $\mathcal{F}$ to produce five levels of features, where the first three levels are convolution-based features and the last two are attention-based features. These features are further processed by the flow estimator that consists of a low-res decoder $\mathcal{G}_{low}$ and three high-res decoders $\mathcal{G}_{high}^l$. This process, colored green, generates the motion flows and composition map, $W^0$. The high-res features, the input images and the estimated motion are further processed by the refinement module to generate the residual term $\textbf{R}$, colored orange. The final prediction, indicated by black lines, is the composition of the two input images plus the residual term.
Figure 3: Qualitative comparison.

LADDER: An Efficient Framework for Video Frame Interpolation

TL;DR

Abstract

LADDER: An Efficient Framework for Video Frame Interpolation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)