LADDER: An Efficient Framework for Video Frame Interpolation
Tong Shen, Dong Li, Ziheng Gao, Lu Tian, Emad Barsoum
TL;DR
This work tackles video frame interpolation (VFI) with a focus on balancing efficiency and quality. It introduces LADDER, a framework that combines a flow estimator using large-kernel depth-wise convolutions with a decoder-only refinement module and an HD-aware augmentation strategy to improve performance on HD frames. Experiments on Vimeo90K, UCF101, Xiph, and SNU-FILM show state-of-the-art results while substantially reducing FLOPs and parameter counts, thanks to careful architectural choices and training procedures. Ablation studies validate the contributions of large-kernel flow estimation, decoder-only refinement, HD-aware augmentation, and a two-stage training regime, suggesting this approach as a strong, practical baseline for efficient VFI.
Abstract
Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc. This paper introduces an efficient video frame interpolation framework that aims to strike a favorable balance between efficiency and quality. Our framework follows a general paradigm consisting of a flow estimator and a refinement module, while incorporating carefully designed components. First of all, we adopt depth-wise convolution with large kernels in the flow estimator that simultaneously reduces the parameters and enhances the receptive field for encoding rich context and handling complex motion. Secondly, diverging from a common design for the refinement module with a UNet-structure (encoder-decoder structure), which we find redundant, our decoder-only refinement module directly enhances the result from coarse to fine features, offering a more efficient process. In addition, to address the challenge of handling high-definition frames, we also introduce an innovative HD-aware augmentation strategy during training, leading to consistent enhancement on HD images. Extensive experiments are conducted on diverse datasets, Vimeo90K, UCF101, Xiph and SNU-FILM. The results demonstrate that our approach achieves state-of-the-art performance with clear improvement while requiring much less FLOPs and parameters, reaching to a better spot for balancing efficiency and quality.
