Table of Contents
Fetching ...

Generalizable Implicit Motion Modeling for Video Frame Interpolation

Zujin Guo, Wei Li, Chen Change Loy

TL;DR

This work tackles the challenge of modeling complex spatiotemporal motion for video frame interpolation by introducing Generalizable Implicit Motion Modeling (GIMM). GIMM encodes motion priors from bidirectional flows through a Motion Encoder and forward warping to generate an instance-specific motion latent, which conditions an adaptive coordinate-based network to predict continuous bilateral flows for arbitrary timestamps. The approach can be plugged into existing flow-based VFI pipelines (e.g., AMT) to produce high-quality interpolations, and it achieves state-of-the-art performance on benchmarks for arbitrary-timestep interpolation, including Vimeo90K-derived motion-learning tasks and SNU-FILM-arb/XTest. The results demonstrate that explicit, generalizable implicit motion modeling with input priors yields more accurate and coherent motion representations across diverse videos, with practical implications for slow-motion synthesis, video editing, and compression.

Abstract

Motion modeling is critical in flow-based Video Frame Interpolation (VFI). Existing paradigms either consider linear combinations of bidirectional flows or directly predict bilateral flows for given timestamps without exploring favorable motion priors, thus lacking the capability of effectively modeling spatiotemporal dynamics in real-world videos. To address this limitation, in this study, we introduce Generalizable Implicit Motion Modeling (GIMM), a novel and effective approach to motion modeling for VFI. Specifically, to enable GIMM as an effective motion modeling paradigm, we design a motion encoding pipeline to model spatiotemporal motion latent from bidirectional flows extracted from pre-trained flow estimators, effectively representing input-specific motion priors. Then, we implicitly predict arbitrary-timestep optical flows within two adjacent input frames via an adaptive coordinate-based neural network, with spatiotemporal coordinates and motion latent as inputs. Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion. We show that GIMM performs better than the current state of the art on standard VFI benchmarks.

Generalizable Implicit Motion Modeling for Video Frame Interpolation

TL;DR

This work tackles the challenge of modeling complex spatiotemporal motion for video frame interpolation by introducing Generalizable Implicit Motion Modeling (GIMM). GIMM encodes motion priors from bidirectional flows through a Motion Encoder and forward warping to generate an instance-specific motion latent, which conditions an adaptive coordinate-based network to predict continuous bilateral flows for arbitrary timestamps. The approach can be plugged into existing flow-based VFI pipelines (e.g., AMT) to produce high-quality interpolations, and it achieves state-of-the-art performance on benchmarks for arbitrary-timestep interpolation, including Vimeo90K-derived motion-learning tasks and SNU-FILM-arb/XTest. The results demonstrate that explicit, generalizable implicit motion modeling with input priors yields more accurate and coherent motion representations across diverse videos, with practical implications for slow-motion synthesis, video editing, and compression.

Abstract

Motion modeling is critical in flow-based Video Frame Interpolation (VFI). Existing paradigms either consider linear combinations of bidirectional flows or directly predict bilateral flows for given timestamps without exploring favorable motion priors, thus lacking the capability of effectively modeling spatiotemporal dynamics in real-world videos. To address this limitation, in this study, we introduce Generalizable Implicit Motion Modeling (GIMM), a novel and effective approach to motion modeling for VFI. Specifically, to enable GIMM as an effective motion modeling paradigm, we design a motion encoding pipeline to model spatiotemporal motion latent from bidirectional flows extracted from pre-trained flow estimators, effectively representing input-specific motion priors. Then, we implicitly predict arbitrary-timestep optical flows within two adjacent input frames via an adaptive coordinate-based neural network, with spatiotemporal coordinates and motion latent as inputs. Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion. We show that GIMM performs better than the current state of the art on standard VFI benchmarks.
Paper Structure (21 sections, 14 equations, 9 figures, 7 tables)

This paper contains 21 sections, 14 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Schematic of motion modeling paradigms in video frame interpolation. (a) A naïve linear combination of bidirectional flows $F_{0\rightarrow1}, F_{1\rightarrow0}$ (i.e., flows between input frames) may lead to ambiguous and coarse motion estimation due to strong overlapped and linear assumptions. (b) A time-condition-based modeling approach may predict suboptimal bilateral flows $F_{t\rightarrow0}, F_{t\rightarrow1}$ (i.e., flows between estimated and input frames), capturing spatiotemporal changes for moving objects ineffectively. (c) Our generalizable implicit motion modeling properly represents spatiotemporal dynamics across videos and predict better bilateral flows via an adaptive coordinate-based neural network.
  • Figure 2: Our GIMM first transforms initial bidirectional flows $F_{0\rightarrow1},F_{1\rightarrow0}$ as normalized flows $V_0, V_1$. The motion encoder extracts motion features $K_0, K_1$ from $V_0, V_1$, respectively. $K_0, K_1$ are then forward warped at a given timestep $t$ using bidirectional flows to obtain the warped features $K_{t\rightarrow0}, K_{t\rightarrow1}$. We pass the warped and initial motion features into a Latent Refiner that outputs motion latent $L_t$, representing motion information at $t$. Conditioned on $L_t(x,y)$, the coordinate-based network $g_{\theta}$ predicts the corresponding normalized flow $V_t$ with 3D coordinates $\textbf{x}=(x,y,t)$. For interpolation usage, $V_t$ is then transferred into bilateral flows $F_{t\rightarrow0},F_{t\rightarrow1}$ through denormalization.
  • Figure 3: An overview of GIMM-VFI architecture. GIMM-VFI employs a pre-trained flow estimator, $\mathcal{E}$, to predict bidirectional flows $(F_{0\rightarrow1}, F_{1\rightarrow0})$ and extracts context features $A$ as well as correlation features $C$ from the input frames $(I_0, I_1)$. Given the timestep $t$, a generalizable implicit motion modeling (GIMM) module $\mathcal{G}$ (detailed in Figure \ref{['fig:gimm']}) takes the bidirectional flows as inputs and predicts bilateral flows $(F_{t\rightarrow0}, F_{t\rightarrow1})$, which are then passed into a frame synthesis module $\mathcal{S}$, together with extracted features $(A, C)$, to synthesize the target frame $I_t$.
  • Figure 4: Qualitative comparisons of different motion modeling methods on SNU-FILM-arb-Hard. All the results are predicted at $t=0.75$, and ground truth flows are obtained by FlowFormer huang2022flowformer.
  • Figure 5: Qualitative comparisons of arbitrary-timestep interpolation on XTest-2K sim2021xvfi. Positions pointed by the yellow arrow indicate the distinct performance of our method.
  • ...and 4 more figures