Table of Contents
Fetching ...

MotionPro: A Precise Motion Controller for Image-to-Video Generation

Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, Tao Mei

TL;DR

This paper tackles controllable motion in image-to-video diffusion by overcoming coarse motion and motion-category ambiguity inherent to Gaussian-extended trajectories. It introduces MotionPro, which jointly uses region-wise trajectories sampled from local optical-flow regions and a motion mask derived from flow maps, enabling precise fine-grained motion and robust object-versus-camera motion understanding. The method builds on Stable Video Diffusion with a motion encoder that modulates video latents through adaptive feature modulation, enhanced by LoRA in all attention modules. Evaluations on WebVid-10M and the newly curated MC-Bench demonstrate state-of-the-art performance in both fine-grained and object-level motion control, with improved trajectory alignment and richer motion dynamics. The MC-Bench benchmark further provides a standardized, annotated dataset for evaluating controllable I2V motion, reinforcing the practical impact of region-wise motion conditioning for interactive video generation.

Abstract

Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: https://zhw-zhang.github.io/MotionPro-page/.

MotionPro: A Precise Motion Controller for Image-to-Video Generation

TL;DR

This paper tackles controllable motion in image-to-video diffusion by overcoming coarse motion and motion-category ambiguity inherent to Gaussian-extended trajectories. It introduces MotionPro, which jointly uses region-wise trajectories sampled from local optical-flow regions and a motion mask derived from flow maps, enabling precise fine-grained motion and robust object-versus-camera motion understanding. The method builds on Stable Video Diffusion with a motion encoder that modulates video latents through adaptive feature modulation, enhanced by LoRA in all attention modules. Evaluations on WebVid-10M and the newly curated MC-Bench demonstrate state-of-the-art performance in both fine-grained and object-level motion control, with improved trajectory alignment and richer motion dynamics. The MC-Bench benchmark further provides a standardized, annotated dataset for evaluating controllable I2V motion, reinforcing the practical impact of region-wise motion conditioning for interactive video generation.

Abstract

Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: https://zhw-zhang.github.io/MotionPro-page/.

Paper Structure

This paper contains 21 sections, 9 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: An illustration of (a) fine-grained and (b) object-level motion control by using typical Gaussian filtered trajectory and our MotionPro. The flow of generated videos are also visualized.
  • Figure 2: An overview of (a) our MotionPro for controllable I2V generation and (b) pipeline of motion condition generation. During training, MotionPro first extracts the proposed region-wise trajectory and motion mask on the input video as the control signals. The multi-scale features are then learnt on these signals by a motion encoder, and further injected into the 3D-UNet of SVD in a feature modulation manner. Meanwhile, LoRA layers are integrated into all attention modules in the transformer blocks to improve the optimization of motion-trajectory alignment. In the inference stage, the region-wise trajectory and motion mask are first derived from the user provided trajectory and brushed region, and then exploited as the guidance to calibrate I2V video generation.
  • Figure 3: An illustration of adaptive feature modulation.
  • Figure 4: Examples of fine-grained motion control results on MC-Bench. The input control signals include the reference image, trajectory and motion mask. Best viewed with Acrobat Reader for the animated videos.
  • Figure 5: Examples of object-level motion control results on MC-Bench. The input control signals include reference image, trajectory and motion mask. MotionPro can successfully handle complicated (e.g., the round trip of sun in the 1st case) and counterintuitive (e.g., the train moving back in the 3rd case) motion-trajectory alignment. Best viewed with Acrobat Reader.
  • ...and 7 more figures