Table of Contents
Fetching ...

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Shuwei Shi, Biao Gong, Xi Chen, Dandan Zheng, Shuai Tan, Zizheng Yang, Yuyuan Li, Jingwen He, Kecheng Zheng, Jingdong Chen, Ming Yang, Yinqiang Zheng

TL;DR

MotionStone tackles the challenge of motion-aware image-to-video generation by introducing a dedicated motion estimator that disentangles object and camera motion. The estimator is trained with a contrastive, pairwise annotation scheme complemented by regression-based pseudo-labels from motion tracking, and its outputs are injected into a diffusion Transformer through a decoupled conditioning mechanism. This approach yields state-of-the-art results in text-guided motion control and demonstrates strong generalization across domains, while giving users explicit control over object and camera motion intensities. By providing a plug-in motion supervision and decoupled injection strategy, MotionStone offers a scalable path toward more realistic, controllable motion in diffusion-based video synthesis.

Abstract

The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

TL;DR

MotionStone tackles the challenge of motion-aware image-to-video generation by introducing a dedicated motion estimator that disentangles object and camera motion. The estimator is trained with a contrastive, pairwise annotation scheme complemented by regression-based pseudo-labels from motion tracking, and its outputs are injected into a diffusion Transformer through a decoupled conditioning mechanism. This approach yields state-of-the-art results in text-guided motion control and demonstrates strong generalization across domains, while giving users explicit control over object and camera motion intensities. By providing a plug-in motion supervision and decoupled injection strategy, MotionStone offers a scalable path toward more realistic, controllable motion in diffusion-based video synthesis.

Abstract

The image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal. These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild. Traditional metrics, e.g., SSIM or optical flow, are hard to generalize to arbitrary videos, while, it is very tough for human annotators to label the abstract motion intensity neither. Furthermore, the motion intensity shall reveal both local object motion and global camera movement, which has not been studied before. This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video. We leverage the contrastive learning on randomly paired videos and distinguish the video with greater motion intensity. Such a paradigm is friendly for annotation and easy to scale up to achieve stable performance on motion estimation. We then present a new I2V model, named MotionStone, developed with the decoupled motion estimator. Experimental results demonstrate the stability of the proposed motion estimator and the state-of-the-art performance of MotionStone on I2V generation. These advantages warrant the decoupled motion estimator to serve as a general plug-in enhancer for both data processing and video generation training.

Paper Structure

This paper contains 20 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Samples generated by MotionStone. Our model achieves accurate motion instruction following (rows-1 and rows-2), and is controllable, easily adapting to specified object motion intensities (row-3) and camera motion intensities (row-4).
  • Figure 2: Illustration of the motion decoupling. Decoupling these two types of motion helps the diffusion model learn specific motion patterns, thereby improving the dynamics and controllability of the generated video.
  • Figure 3: The framework of MotionStone. The first frame of the video serves as the conditioning image, while object and camera motion intensities (ranging from $1$ to $10$) are predicted by the motion estimator and can be customized by users during inference. At the top, the object and camera motion intensities predicted by the motion estimator are processed through an MLP respectively to obtain corresponding embeddings, which are then concatenated along the channel dimension to form the Decoupled Motion Embedding. This embedding is added to the time embedding and injected into the Diffusion Transformer to generate videos.
  • Figure 4: Qualitative comparison with other methods. We compare our MotionStone with I2VGEN-XL zhang2023i2vgen, SVD blattmann2023stable, AnimateAnything dai2023animateanything and CogvideoX yang2024cogvideox. MotionStone demonstrates superior alignment with text and image inputs compared to other methods (Example 2 and Example 4). Additionally, as shown in Example 1, it highlights the ability of camera controlling, while other methods tend to remain static frames. Example 3 showcases the capacity of MotionStone to control object movements, whereas other methods either remain static frames or produce unrealistic scenes that defy physical principles.
  • Figure 5: Illustrations of camera motion intensity guidance. We present two common camera movements: Zoom and Pan. Since the camera movement often impacts object motion in scenes with moving subjects, we fix the object motion intensity at $\bm{5}$ to isolate and highlight the effect of varying camera motion intensity. The camera movement becomes significant when the score increases.
  • ...and 4 more figures