Table of Contents
Fetching ...

Disentangled Motion Modeling for Video Frame Interpolation

Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, Sungroh Yoon

TL;DR

Disentangled Motion Modeling for Video Frame Interpolation introduces MoMo, a diffusion-based approach that concentrates on modeling intermediate motion (bi-directional optical flows) rather than direct pixel-space generation to boost perceptual quality. The method uses a two-stage training regime: first, train a frame synthesis network and a teacher optical-flow model; second, train a motion diffusion model that produces intermediate flows used by the synthesizer, aided by a lightweight architecture optimized for flow. A low-resolution diffusion strategy with convex upsampling and no attention accelerates motion prediction, enabling high-quality interpolation with substantially reduced compute compared to pixel-space diffusion baselines. Empirical results across Vimeo90k, SNU-FILM, Middlebury, and Xiph demonstrate state-of-the-art perceptual metrics (LPIPS, DISTS) and strong qualitative performance, with orders-of-magnitude faster runtimes than existing diffusion-based VFI methods. This work advances perceptual VFI by decoupling motion generation from pixel synthesis and by tailoring diffusion models to the structured nature of optical flow, offering practical gains in both quality and efficiency.

Abstract

Video Frame Interpolation (VFI) aims to synthesize intermediate frames between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works have employed generative models for improved perceptual quality. However, they require complex training and large computational costs for pixel space modeling. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose a disentangled two-stage training process. In the initial stage, frame synthesis and flow models are trained to generate accurate frames and flows optimal for synthesis. In the subsequent stage, we introduce a motion diffusion model, which incorporates our novel U-Net architecture specifically designed for optical flow, to generate bi-directional flows between frames. By learning the simpler low-frequency representation of motions, MoMo achieves superior perceptual quality with reduced computational demands compared to the generative modeling methods on the pixel space. MoMo surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI.

Disentangled Motion Modeling for Video Frame Interpolation

TL;DR

Disentangled Motion Modeling for Video Frame Interpolation introduces MoMo, a diffusion-based approach that concentrates on modeling intermediate motion (bi-directional optical flows) rather than direct pixel-space generation to boost perceptual quality. The method uses a two-stage training regime: first, train a frame synthesis network and a teacher optical-flow model; second, train a motion diffusion model that produces intermediate flows used by the synthesizer, aided by a lightweight architecture optimized for flow. A low-resolution diffusion strategy with convex upsampling and no attention accelerates motion prediction, enabling high-quality interpolation with substantially reduced compute compared to pixel-space diffusion baselines. Empirical results across Vimeo90k, SNU-FILM, Middlebury, and Xiph demonstrate state-of-the-art perceptual metrics (LPIPS, DISTS) and strong qualitative performance, with orders-of-magnitude faster runtimes than existing diffusion-based VFI methods. This work advances perceptual VFI by decoupling motion generation from pixel synthesis and by tailoring diffusion models to the structured nature of optical flow, offering practical gains in both quality and efficiency.

Abstract

Video Frame Interpolation (VFI) aims to synthesize intermediate frames between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works have employed generative models for improved perceptual quality. However, they require complex training and large computational costs for pixel space modeling. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose a disentangled two-stage training process. In the initial stage, frame synthesis and flow models are trained to generate accurate frames and flows optimal for synthesis. In the subsequent stage, we introduce a motion diffusion model, which incorporates our novel U-Net architecture specifically designed for optical flow, to generate bi-directional flows between frames. By learning the simpler low-frequency representation of motions, MoMo achieves superior perceptual quality with reduced computational demands compared to the generative modeling methods on the pixel space. MoMo surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI.
Paper Structure (39 sections, 14 equations, 7 figures, 10 tables)

This paper contains 39 sections, 14 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Video frame interpolation results of our proposed method called MoMo with comparison to state-of-the-art methods. MoMo produces the most visually pleasant result, owing to proper modeling of the intermediate motion.
  • Figure 2: Overview of our entire framework. The training procedure operates in two stages. Initially, we train a frame synthesis network and an optical flow model, with the latter providing pseudo-labels for the second stage. In the second stage of training, we focus on training a Motion Diffusion Model to predict bi-directional flow between frames. During inference, the Motion Diffusion Model generates flow fields given the input frame pair, which the frame synthesis model uses to generate the output.
  • Figure 3: Architecture of our motion diffusion model. The input pair frames are downsampled to an $8\times$ smaller size and goes through a 3-level U-Net, which outputs a pair of coarse flow maps and their corresponding weight masks for upsampling. The convex upsampling layer takes the coarse flow maps and weight masks to return the full resolution flow maps.
  • Figure 4: Visualized comparison of estimated intermediate flows against state-of-the-art methods. Our flow estimations show better-structured flow fields which leads to promising synthesis of frames.
  • Figure 5: Qualitative comparison against state-of-the-art methods on 'extreme' subset of SNU-FILM and Xiph-4K. Our results show the least artifacts and generate well-structured images.
  • ...and 2 more figures