Table of Contents
Fetching ...

Unified Arbitrary-Time Video Frame Interpolation and Prediction

Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm

TL;DR

This work addresses the fragmentation between video frame interpolation and prediction by introducing uniVIP, a unified model capable of arbitrary-time frame synthesis for both tasks. Building on the forward-warping paradigm of UPR-Net, uniVIP adds a task-aware channel and trains on mixed triplets to learn a shared, flexible representation for interpolation and prediction across non-integer time steps. The approach demonstrates competitive interpolation accuracy and state-of-the-art performance for next-frame and arbitrary-time prediction, while maintaining a compact, unified architecture with efficient runtime. The results suggest practical benefits for deployment in scenarios requiring both tasks, and the ablations validate the effectiveness of the fusion strategy, single-direction training, and task signaling.

Abstract

Video frame interpolation and prediction aim to synthesize frames in-between and subsequent to existing frames, respectively. Despite being closely-related, these two tasks are traditionally studied with different model architectures, or same architecture but individually trained weights. Furthermore, while arbitrary-time interpolation has been extensively studied, the value of arbitrary-time prediction has been largely overlooked. In this work, we present uniVIP - unified arbitrary-time Video Interpolation and Prediction. Technically, we firstly extend an interpolation-only network for arbitrary-time interpolation and prediction, with a special input channel for task (interpolation or prediction) encoding. Then, we show how to train a unified model on common triplet frames. Our uniVIP provides competitive results for video interpolation, and outperforms existing state-of-the-arts for video prediction. Codes will be available at: https://github.com/srcn-ivl/uniVIP

Unified Arbitrary-Time Video Frame Interpolation and Prediction

TL;DR

This work addresses the fragmentation between video frame interpolation and prediction by introducing uniVIP, a unified model capable of arbitrary-time frame synthesis for both tasks. Building on the forward-warping paradigm of UPR-Net, uniVIP adds a task-aware channel and trains on mixed triplets to learn a shared, flexible representation for interpolation and prediction across non-integer time steps. The approach demonstrates competitive interpolation accuracy and state-of-the-art performance for next-frame and arbitrary-time prediction, while maintaining a compact, unified architecture with efficient runtime. The results suggest practical benefits for deployment in scenarios requiring both tasks, and the ablations validate the effectiveness of the fusion strategy, single-direction training, and task signaling.

Abstract

Video frame interpolation and prediction aim to synthesize frames in-between and subsequent to existing frames, respectively. Despite being closely-related, these two tasks are traditionally studied with different model architectures, or same architecture but individually trained weights. Furthermore, while arbitrary-time interpolation has been extensively studied, the value of arbitrary-time prediction has been largely overlooked. In this work, we present uniVIP - unified arbitrary-time Video Interpolation and Prediction. Technically, we firstly extend an interpolation-only network for arbitrary-time interpolation and prediction, with a special input channel for task (interpolation or prediction) encoding. Then, we show how to train a unified model on common triplet frames. Our uniVIP provides competitive results for video interpolation, and outperforms existing state-of-the-arts for video prediction. Codes will be available at: https://github.com/srcn-ivl/uniVIP

Paper Structure

This paper contains 33 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Conceptual comparisons between our uniVIP model and exiting video frame interpolation and prediction models. We pursue a unified model for both tasks, enabling frame synthesis at arbitrary-time (e.g., $t$ = 1.4 or -0.6).
  • Figure 2: Overview of uniVIP at each pyramid level (within a pyramid recurrent framework as in UPR-Net jin2023unified). In our uniVIP, $t$ can be arbitrary value, with a special task channel for differentiating interpolation and prediction tasks.
  • Figure 3: For prediction (e.g., $t=2$), positions of artifacts in warped frames are similar, but for interpolation (e.g., $t=0.5$), the positions are quite different.
  • Figure 4: Left: examples from SNU-FILM and X-TEST for interpolation. Right: examples from SNU-FILM and X-TEST for prediction.