Table of Contents
Fetching ...

LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

Changgu Chen, Xiaoyan Yang, Junwei Shu, Changbo Wang, Yang Li

TL;DR

The paper presents LMP, a zero-shot framework enabling motion priors from a reference video to guide diffusion-transformer (DiT) video generation in both text-to-video and image-to-video settings. It combines a Foreground-Background Disentangle Module, a Reweighted Motion Transfer Module, and an Appearance Separation Module to decouple content, inject motion, and suppress reference appearance without any training. The approach extends to real reference videos via a simple noise-addition strategy and is validated on DAVIS with new prompts and metrics, achieving state-of-the-art performance in generation quality, prompt-video consistency, and motion fidelity. By leveraging DiT's unified token attention, LMP provides a plug-and-play solution for fine-grained, motion-aware video synthesis with broad practical applicability.

Abstract

In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at https://vpx-ecnu.github.io/LMP-Website/

LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

TL;DR

The paper presents LMP, a zero-shot framework enabling motion priors from a reference video to guide diffusion-transformer (DiT) video generation in both text-to-video and image-to-video settings. It combines a Foreground-Background Disentangle Module, a Reweighted Motion Transfer Module, and an Appearance Separation Module to decouple content, inject motion, and suppress reference appearance without any training. The approach extends to real reference videos via a simple noise-addition strategy and is validated on DAVIS with new prompts and metrics, achieving state-of-the-art performance in generation quality, prompt-video consistency, and motion fidelity. By leveraging DiT's unified token attention, LMP provides a plug-and-play solution for fine-grained, motion-aware video synthesis with broad practical applicability.

Abstract

In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at https://vpx-ecnu.github.io/LMP-Website/

Paper Structure

This paper contains 21 sections, 10 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: The core idea of our FBDM. We achieve disentanglement of the foreground and background by utilizing text-video attention maps and video-text attention maps.
  • Figure 2: The pipeline of our LMP framework in each MM-DiT block. For the first $T_1$ denoising steps, we use FBDM and RMTM to enable the target video to reference the motion from the reference video. For denoising steps $T_2$ to $T_3$, we employ ASM to suppress the subject appearance information of the reference video in the target video.
  • Figure 3: The text-to-video results of our LMP framework. The original videos are available in the supplementary material.
  • Figure 4: The image-to-video results of our LMP framework. The original videos are available in the supplementary material.
  • Figure 5: Quality comparison results on different methods in text-to-video setting.
  • ...and 3 more figures