Table of Contents
Fetching ...

MotionBridge: Dynamic Video Inbetweening with Flexible Controls

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

TL;DR

The paper addresses controllable video inbetweening for large, multi-modal motions by introducing MotionBridge, a DiT-based framework with dual-branch encoders for content and motion and two dedicated generators (Sparse Motion Generator and Augmented Frame Generator). A curriculum learning strategy enables progressive incorporation of controls, improving motion fidelity and contextual accuracy while remaining backbone-agnostic. Comprehensive experiments demonstrate strong qualitative and quantitative performance, generalization to different backbones, and useful applications such as looping video and image animation. The work enhances controllable video synthesis and paves the way for tighter integration with text-to-video and image-to-video pipelines, with potential extensions to 3D transformations.

Abstract

By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

MotionBridge: Dynamic Video Inbetweening with Flexible Controls

TL;DR

The paper addresses controllable video inbetweening for large, multi-modal motions by introducing MotionBridge, a DiT-based framework with dual-branch encoders for content and motion and two dedicated generators (Sparse Motion Generator and Augmented Frame Generator). A curriculum learning strategy enables progressive incorporation of controls, improving motion fidelity and contextual accuracy while remaining backbone-agnostic. Comprehensive experiments demonstrate strong qualitative and quantitative performance, generalization to different backbones, and useful applications such as looping video and image animation. The work enhances controllable video synthesis and paves the way for tighter integration with text-to-video and image-to-video pipelines, with potential extensions to 3D transformations.

Abstract

By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Paper Structure

This paper contains 19 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: MotionBridge generates smooth and plausible transitions between two RGB images following user-defined trajectories, producing large and intricate motions (see top two rows for diverse results for the same dog). It offers multi-object control with motions varying between objects (bee + flowers), as well as mask control specifying static (red) vs. dynamic regions (blue); see last two rows. In last row, the static mask helps maintain the lady in the same position while turning her body naturally.
  • Figure 2: Overview of our MotionBridge pipeline. Given a video $X$, we propose a Sparse Motion Generator to provide conditioning for the motion trajectory with sparse RGB point controls, and an Augmented Frame Generator to compute guiding pixels for providing fine-grained control. The control signals are encoded through dual-branch embedders respectively to capture accurate content and motion features. Our model is flexible to take multi-modal controls for interpolation during the inference.
  • Figure 3: Structure of Sparse Motion Generator. The input video $X$ is processed through an optical flow generator to extract trajectories. These are then filtered with a Gaussian filter and converted to images to create sparse RGB point controls.
  • Figure 4: Our results. MotionBridge seamlessly integrates motion trajectories and two input frames, enabling smooth transitions. Additionally, we present an example using a mask to control the complete object movement in the first row. By further specifying different input prompts, the results are adapted accordingly, as shown in the last two rows.
  • Figure 5: Qualitative comparisons. Our model can generate better interpolation results with less artifacts.
  • ...and 7 more figures