Table of Contents
Fetching ...

SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality

Pan Ben Wong, Chengli Wu, Hanyue Lu

TL;DR

SG-RIFE tackles the real-time VFI trade-off by injecting dense semantic priors from a frozen DINOv3 backbone into a RIFE-based flow interpolation framework. It introduces Split-FAPM to compress semantic features and DSF to align them with motion-driven pixel contexts, enabling high perceptual fidelity without the latency of diffusion models. The approach yields diffusion-competitive FID/LPIPS on SNU-FILM while running in real time and using a small trainable parameter budget. This work highlights the practical potential of semantic priors to overcome flow-based limitations in complex motion, offering a scalable path to high-quality, low-latency VFI.

Abstract

Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.

SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality

TL;DR

SG-RIFE tackles the real-time VFI trade-off by injecting dense semantic priors from a frozen DINOv3 backbone into a RIFE-based flow interpolation framework. It introduces Split-FAPM to compress semantic features and DSF to align them with motion-driven pixel contexts, enabling high perceptual fidelity without the latency of diffusion models. The approach yields diffusion-competitive FID/LPIPS on SNU-FILM while running in real time and using a small trainable parameter budget. This work highlights the practical potential of semantic priors to overcome flow-based limitations in complex motion, offering a scalable path to high-quality, low-latency VFI.

Abstract

Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.

Paper Structure

This paper contains 14 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Semantic Stability Analysis. DINOv3 features for two consecutive frames using a shared PCA basis. Despite large displacement between $I_0$ (Left) and $I_1$ (Right), the semantic representations of moving objects (e.g., the pedestrians) have consistent color signatures. This stability allows SG-RIFE to maintain object identity where traditional flow-based context typically degrades.
  • Figure 2: Overview of SG-RIFE. We extract semantic features from a frozen DINOv3 backbone. These features are compressed via the Split-FAPM, warped using RIFE's intermediate flow, and aligned via Deformable Semantic Fusion (DSF) before being injected into the FusionNet bottleneck.
  • Figure 3: Hierarchical Feature Selection. Visualization of principal components from DINOv3 layers. Layer 8 (Left) exhibits high-frequency variance corresponding to local textural patterns and edges. Layer 11 (Right) demonstrates semantic stability, providing semantically coherent guidance that helps the FusionNet preserve global object boundaries.
  • Figure 4: Visualization of the Flow-Guided Deformable Alignment (DSF) Mechanism.Left: Coarsely warped semantic features exhibit misalignment due to optical flow errors. Center: The offset magnitude maps ($||\Delta p||$) reveal the active alignment regions. Note the high activation (red) on the dynamic foreground subjects, indicating that the module is performing correction to compensate for flow inaccuracies. Right: The final bi-directional fusion result demonstrates seamless integration of the corrected features.
  • Figure 5: Qualitative Comparison on the SNU-FILM (Extreme) dataset. (a) The overlaid input frames ($I_0$ and $I_1$) illustrate the large motion. (c) The baseline RIFE suffers from ghosting. (d) Our SG-RIFE successfully refines high-frequency textures and mitigates ghosting.