MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng
TL;DR
MOFA-Video introduces domain-aware Generative Motion Field Adapters (MOFA-Adapters) that convert sparse control signals (trajectories, facial landmarks, etc.) into dense motion fields to steer video generation from a single image within a frozen Stable Video Diffusion model. The MOFA-Adapter architecture—Sparse-to-Dense Motion Generator, Reference Encoder, and Fusion Encoder—warps multi-scale image features and injects them into the diffusion backbone, enabling cross-domain, zero-shot combination of controls. Training occurs per domain on Stable Video Diffusion, using sparse hints from optical flow or landmarks and a distillation-style loss to align downstream latent representations. Inference supports varied signals (trajectory, facial-keypoint-based, motion brushes) and longer videos via periodic sampling, with the capability to fuse multiple MOFA-Adapters for complex, multi-modal animation. Overall, MOFA-Video delivers a flexible, unified framework for controllable, open-world image animation with improved temporal consistency and user control.
Abstract
We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation. Project Page: https://myniuuu.github.io/MOFA_Video/
