Unified Arbitrary-Time Video Frame Interpolation and Prediction
Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm
TL;DR
This work addresses the fragmentation between video frame interpolation and prediction by introducing uniVIP, a unified model capable of arbitrary-time frame synthesis for both tasks. Building on the forward-warping paradigm of UPR-Net, uniVIP adds a task-aware channel and trains on mixed triplets to learn a shared, flexible representation for interpolation and prediction across non-integer time steps. The approach demonstrates competitive interpolation accuracy and state-of-the-art performance for next-frame and arbitrary-time prediction, while maintaining a compact, unified architecture with efficient runtime. The results suggest practical benefits for deployment in scenarios requiring both tasks, and the ablations validate the effectiveness of the fusion strategy, single-direction training, and task signaling.
Abstract
Video frame interpolation and prediction aim to synthesize frames in-between and subsequent to existing frames, respectively. Despite being closely-related, these two tasks are traditionally studied with different model architectures, or same architecture but individually trained weights. Furthermore, while arbitrary-time interpolation has been extensively studied, the value of arbitrary-time prediction has been largely overlooked. In this work, we present uniVIP - unified arbitrary-time Video Interpolation and Prediction. Technically, we firstly extend an interpolation-only network for arbitrary-time interpolation and prediction, with a special input channel for task (interpolation or prediction) encoding. Then, we show how to train a unified model on common triplet frames. Our uniVIP provides competitive results for video interpolation, and outperforms existing state-of-the-arts for video prediction. Codes will be available at: https://github.com/srcn-ivl/uniVIP
