Table of Contents
Fetching ...

Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM

Yan Han, Xiaogang Xu, Yingqi Lin, Jiafei Wu, Zhe Liu, Ming-Hsuan Yang

TL;DR

This work tackles restoration-oriented Video Frame Interpolation by addressing region-level motion ambiguity. It introduces Region-Distinguishable Priors (RDPs) derived from SAM2 and a Hierarchical Region-aware Feature Fusion Module (HRFFM) to augment VFI encoders, enabling more consistent region representations across adjacent frames. The RDPs are crafted as spatially varying Gaussian mixtures via a Gaussian embedding of SAM2 masks, and are fused through RDP-guided Feature Normalization to improve motion estimation. Extensive experiments across multiple datasets and baselines demonstrate consistent improvements with minimal overhead, highlighting the method's practical potential for enhancing interpolation fidelity and boundary sharpness in diverse scenes.

Abstract

In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.

Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM

TL;DR

This work tackles restoration-oriented Video Frame Interpolation by addressing region-level motion ambiguity. It introduces Region-Distinguishable Priors (RDPs) derived from SAM2 and a Hierarchical Region-aware Feature Fusion Module (HRFFM) to augment VFI encoders, enabling more consistent region representations across adjacent frames. The RDPs are crafted as spatially varying Gaussian mixtures via a Gaussian embedding of SAM2 masks, and are fused through RDP-guided Feature Normalization to improve motion estimation. Extensive experiments across multiple datasets and baselines demonstrate consistent improvements with minimal overhead, highlighting the method's practical potential for enhancing interpolation fidelity and boundary sharpness in diverse scenes.

Abstract

In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.
Paper Structure (19 sections, 9 equations, 10 figures, 11 tables)

This paper contains 19 sections, 9 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The first two columns: overlay inputs and the ground truth frame. Middle two columns: motion field (from first to second frame) by VFIformer Alpher19 and corresponding interpolation. The last two columns: motion field and interpolated frame by enhancing VFIformer with our strategy using RDPs. Our approach results in more satisfactory motion estimation and, thus, better interpolation results.
  • Figure 2: The standard framework of motion-based VFI. It consists of three stages: extracting the image features from the encoder, making the optical flow estimation, and then warping and decoding it into a frame synthesis module to generate the intermediate frame. Our proposed HRFFM incorporates the prior RDP $S_i$ into the hierarchical stage of the encoder.
  • Figure 3: The Overview of HRFFM, which first exploits RDPs to enhance image features via RDPFN (Eq. \ref{['eq:fuse']}), and then refine it via refinement (Eq. \ref{['eq:residual']}). ${f_{i,l},s_{i,l}}$ are the image feature and RDP feature of the $i$-th frame of the $l$-th layer, respectively. $\oplus$ means concatenating.
  • Figure 4: The Overview of RDPFN. It utilizes both RDP features and image features as inputs. It employs a combination of long- and short-range operations to extract impactful features, facilitating the prediction of region-aware normalization parameters. This approach ensures that features within the same instance exhibit similarity, thereby enhancing the effect of subsequent modules. ${\oplus}$ means concatenating , ${\otimes}$ means dot producting.
  • Figure 5: Visual comparison on SNU-FILM ChannelM2020. Three rows, from top to bottom, represent the comparison results for VFIformer, UPR-Net, and M2M-PWC. The highlighted boxes indicate positions where our model demonstrates superior performance.
  • ...and 5 more figures