Table of Contents
Fetching ...

Disambiguation for Video Frame Interpolation

Zhihang Zhong, Yiming Zhang, Wei Wang, Xiao Sun, Yu Qiao, Gurunandan Krishnan, Sizhuo Ma, Jian Wang

TL;DR

This work addresses velocity ambiguity in video frame interpolation by reframing the problem with distance indexing, which provides a deterministic motion hint during training, and by introducing iterative reference-based estimation to decompose long-range predictions into shorter steps. It further leverages nearby frames through a continuous distance map estimator and a multi-frame refiner to improve interpolation quality across arbitrary time timesteps, while enabling per-object manipulation via segmentation tools like SAM. The approach is plug-and-play with existing VFI models and yields superior perceptual quality (LPIPS, NIQE) and competitive pixel-centric metrics, particularly when multi-frame information is utilized. Collectively, these contributions offer practical tools for high-quality, controllable video interpolation and editing, with potential extensions to more challenging motion regimes using generative priors.

Abstract

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed "distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. Moreover, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames, due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing without requiring extra computation. Furthermore, we demonstrate that if additional latency is acceptable, a continuous map estimator can be employed to compute a pixel-wise dense distance indexing using multiple nearby frames. Combined with efficient multi-frame refinement, this extension can further disambiguate complex motion, thus enhancing performance both qualitatively and quantitatively. Additionally, the ability to manually specify distance indexing allows for independent temporal manipulation of each object, providing a novel tool for video editing tasks such as re-timing.

Disambiguation for Video Frame Interpolation

TL;DR

This work addresses velocity ambiguity in video frame interpolation by reframing the problem with distance indexing, which provides a deterministic motion hint during training, and by introducing iterative reference-based estimation to decompose long-range predictions into shorter steps. It further leverages nearby frames through a continuous distance map estimator and a multi-frame refiner to improve interpolation quality across arbitrary time timesteps, while enabling per-object manipulation via segmentation tools like SAM. The approach is plug-and-play with existing VFI models and yields superior perceptual quality (LPIPS, NIQE) and competitive pixel-centric metrics, particularly when multi-frame information is utilized. Collectively, these contributions offer practical tools for high-quality, controllable video interpolation and editing, with potential extensions to more challenging motion regimes using generative priors.

Abstract

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed "distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. Moreover, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames, due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing without requiring extra computation. Furthermore, we demonstrate that if additional latency is acceptable, a continuous map estimator can be employed to compute a pixel-wise dense distance indexing using multiple nearby frames. Combined with efficient multi-frame refinement, this extension can further disambiguate complex motion, thus enhancing performance both qualitatively and quantitatively. Additionally, the ability to manually specify distance indexing allows for independent temporal manipulation of each object, providing a novel tool for video editing tasks such as re-timing.
Paper Structure (39 sections, 24 equations, 12 figures, 12 tables)

This paper contains 39 sections, 24 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Comparison of time indexing and distance indexing training paradigms. (a) Time indexing uses the starting frame $I_0$, ending frame $I_1$, and a scalar variable $t$ as inputs. (b) Distance indexing replaces the scalar with a distance map $D_t$ and optionally incorporates iterative reference-based estimation $\left(I_{ref}, D_{ref}\right)$ to address velocity ambiguity, resulting in a notably sharper prediction.
  • Figure 2: Velocity ambiguity. (a) Speed ambiguity. (b) Directional ambiguity.
  • Figure 3: Disambiguation strategies for velocity ambiguity. (a) Distance indexing. (b) Iterative reference-based estimation.
  • Figure 4: Calculation of distance map for distance indexing. $V_{0\to t}$ is the estimated optical flow from $I_0$ to $I_t$ by RAFT teed2020raft, and $V_{0\to 1}$ is the optical flow from $I_0$ to $I_1$.
  • Figure 5: Multi-frame fusion architecture with continuous map estimator.
  • ...and 7 more figures