Table of Contents
Fetching ...

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng

TL;DR

MOFA-Video introduces domain-aware Generative Motion Field Adapters (MOFA-Adapters) that convert sparse control signals (trajectories, facial landmarks, etc.) into dense motion fields to steer video generation from a single image within a frozen Stable Video Diffusion model. The MOFA-Adapter architecture—Sparse-to-Dense Motion Generator, Reference Encoder, and Fusion Encoder—warps multi-scale image features and injects them into the diffusion backbone, enabling cross-domain, zero-shot combination of controls. Training occurs per domain on Stable Video Diffusion, using sparse hints from optical flow or landmarks and a distillation-style loss to align downstream latent representations. Inference supports varied signals (trajectory, facial-keypoint-based, motion brushes) and longer videos via periodic sampling, with the capability to fuse multiple MOFA-Adapters for complex, multi-modal animation. Overall, MOFA-Video delivers a flexible, unified framework for controllable, open-world image animation with improved temporal consistency and user control.

Abstract

We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation. Project Page: https://myniuuu.github.io/MOFA_Video/

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

TL;DR

MOFA-Video introduces domain-aware Generative Motion Field Adapters (MOFA-Adapters) that convert sparse control signals (trajectories, facial landmarks, etc.) into dense motion fields to steer video generation from a single image within a frozen Stable Video Diffusion model. The MOFA-Adapter architecture—Sparse-to-Dense Motion Generator, Reference Encoder, and Fusion Encoder—warps multi-scale image features and injects them into the diffusion backbone, enabling cross-domain, zero-shot combination of controls. Training occurs per domain on Stable Video Diffusion, using sparse hints from optical flow or landmarks and a distillation-style loss to align downstream latent representations. Inference supports varied signals (trajectory, facial-keypoint-based, motion brushes) and longer videos via periodic sampling, with the capability to fuse multiple MOFA-Adapters for complex, multi-modal animation. Overall, MOFA-Video delivers a flexible, unified framework for controllable, open-world image animation with improved temporal consistency and user control.

Abstract

We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation. Project Page: https://myniuuu.github.io/MOFA_Video/
Paper Structure (25 sections, 4 equations, 16 figures, 1 table)

This paper contains 25 sections, 4 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: We present MOFA-Video for controllable image animation. We train MOFA-Adapters for (a) manual trajectories animation, (b) facial landmarks sequences animation (SadTalker zhang2023sadtalker is used for audio to landmark generation). These two adaptors can be combined in a zero-shot manner for (c) the animation from both trajectories and human landmarks without retraining.
  • Figure 2: Overview of MOFA-Video. We design MOFA-Adadpters for adapting the motions from different domains with a unified structure on the frozen Video Diffusion Model. It generates the video from a single image and the corresponding sparse motion hints. For training, we generate the sparse motion hints through sparse motion sampling and then train different MOFA-Adapters to generate video via pre-trained SVD svd.
  • Figure 3: Detailed Structure of MOFA-Adapter. It contains an S2D Network that accepts the motion hints and produces a dense motion field of the video. A reference encoder that extracts multi-scale features from the source image. A training-able copy of the SVD encoder, which initializes the weights from SVD and serves as the final spatial-temporal feature merging for generation guidance.
  • Figure 4: Trajectory-based Animation. Image animation results from different trajectories. Results below the dashes are the fine-grained results using motion brushes. Intermediate optical flow results are also visualized.
  • Figure 5: Facial Landmarks based Animation. We produce facial landmarks from the driven audio using SadTalker zhang2023sadtalker. Then, the portrait animation can be produced by the facial MOFA-Adapter. Intermediate optical flow results are also visualized.
  • ...and 11 more figures