Table of Contents
Fetching ...

Sparse Global Matching for Video Frame Interpolation with Large Motion

Chunxu Liu, Guozhen Zhang, Rui Zhao, Limin Wang

TL;DR

This work tackles the difficulty of large-motion video frame interpolation by introducing a two-branch framework that fuses local intermediate-flow estimation with a sparse global matching branch. The method starts with a high-resolution local feature-based estimate of the intermediate flows, then identifies flawed regions via a difference-based mechanism and computes sparse flow compensation using a global receptive field. An adaptive Flow Merge Block fuses local and sparse global information to produce refined intermediate flows, which are further refined to synthesize the target frame. The approach yields state-of-the-art performance on challenging large-motion benchmarks while maintaining strong results on small-to-medium motion data, demonstrating the practical potential of combining local detail with targeted global correspondences for VFI.

Abstract

Large motion poses a critical challenge in Video Frame Interpolation (VFI) task. Existing methods are often constrained by limited receptive fields, resulting in sub-optimal performance when handling scenarios with large motion. In this paper, we introduce a new pipeline for VFI, which can effectively integrate global-level information to alleviate issues associated with large motion. Specifically, we first estimate a pair of initial intermediate flows using a high-resolution feature map for extracting local details. Then, we incorporate a sparse global matching branch to compensate for flow estimation, which consists of identifying flaws in initial flows and generating sparse flow compensation with a global receptive field. Finally, we adaptively merge the initial flow estimation with global flow compensation, yielding a more accurate intermediate flow. To evaluate the effectiveness of our method in handling large motion, we carefully curate a more challenging subset from commonly used benchmarks. Our method demonstrates the state-of-the-art performance on these VFI subsets with large motion.

Sparse Global Matching for Video Frame Interpolation with Large Motion

TL;DR

This work tackles the difficulty of large-motion video frame interpolation by introducing a two-branch framework that fuses local intermediate-flow estimation with a sparse global matching branch. The method starts with a high-resolution local feature-based estimate of the intermediate flows, then identifies flawed regions via a difference-based mechanism and computes sparse flow compensation using a global receptive field. An adaptive Flow Merge Block fuses local and sparse global information to produce refined intermediate flows, which are further refined to synthesize the target frame. The approach yields state-of-the-art performance on challenging large-motion benchmarks while maintaining strong results on small-to-medium motion data, demonstrating the practical potential of combining local detail with targeted global correspondences for VFI.

Abstract

Large motion poses a critical challenge in Video Frame Interpolation (VFI) task. Existing methods are often constrained by limited receptive fields, resulting in sub-optimal performance when handling scenarios with large motion. In this paper, we introduce a new pipeline for VFI, which can effectively integrate global-level information to alleviate issues associated with large motion. Specifically, we first estimate a pair of initial intermediate flows using a high-resolution feature map for extracting local details. Then, we incorporate a sparse global matching branch to compensate for flow estimation, which consists of identifying flaws in initial flows and generating sparse flow compensation with a global receptive field. Finally, we adaptively merge the initial flow estimation with global flow compensation, yielding a more accurate intermediate flow. To evaluate the effectiveness of our method in handling large motion, we carefully curate a more challenging subset from commonly used benchmarks. Our method demonstrates the state-of-the-art performance on these VFI subsets with large motion.
Paper Structure (32 sections, 9 equations, 7 figures, 13 tables)

This paper contains 32 sections, 9 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: (a) Our framework without sparse global matching, pretrained on small motion dataset, for capturing local details. (b) Our framework with sparse global matching, fine-tuned on large motion dataset, for capturing global large motion. (c) Key components in our algorithm, illustrating the effect of our sparse global matching branch. (Using Ours-1/4-Points, from \ref{['tab:main']}.)
  • Figure 2: Overview of our proposed structure. First, local features are extracted by a local feature extractor for flow estimation $\widetilde{F}_{t\rightarrow 0}, \widetilde{F}_{t\rightarrow 1}$. Then, our sparse global matching branch locates the flaws by constructing difference maps $D_0, D_1$. Next, we perform sparse global matching using global features extracted by a global feature extractor. Finally, after shifting global correspondences $f_{0\rightarrow 1}, f_{1\rightarrow 0}$ to intermediate flow compensation $f_{t\rightarrow 1}, f_{t\rightarrow 0}$, we adaptively merge $\widetilde{F}_{t\rightarrow 0}, \widetilde{F}_{t\rightarrow 1}$ with $f_{t\rightarrow 1}, f_{t\rightarrow 0}$ and adopts a flow refine in a residual manner for interpolating the intermediate frame.
  • Figure 3: Large motion dataset benchmark analysis.Top: Whole dataset. Below: Keeping the most challenging half of Xiph and SNU-FILM. Four charts share the same legend.
  • Figure 4: Visual comparison with different methods, instances selected from X-Test-L XVFI. We provide the optical flow magnitude on the left, measured by RAFT RAFT. Four sparsity setting of our methods lies in the red frame. Blue frames places a greater emphasis on demonstrating large motion, while green frames is more inclined to demonstrate the effect on local details. Best viewed in zoom.
  • Figure A5: Model Structure of Local Feature Branch. $a^i_0, a^i_1, i\in \{0, 1, 2, 3\}$ is the extracted local feature, corresponding to the feature resolution of $\{H\times W, H/2\times W/2, H/4\times W/4, H/8\times W/8\}$
  • ...and 2 more figures