Table of Contents
Fetching ...

MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching

Yen-Siang Wu, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang

TL;DR

MotionMatcher tackles motion customization for Text-to-Video diffusion by moving from pixel-level losses to high-level motion feature matching. It uses a frozen pre-trained T2V diffusion model as a motion feature extractor and leverages cross-attention maps for camera framing and temporal self-attention maps for object dynamics, combining them into a motion feature loss $\mathcal{L}_{\rm mot}$ to fine-tune the base model via LoRA. The approach achieves state-of-the-art motion transfer while preserving the model’s prior knowledge, outperforming baselines in text alignment, frame consistency, and motion fidelity, as evidenced by quantitative metrics and human studies. This method improves joint controllability of text and motion for scene staging in AI-generated videos, offering memory-efficient, controllable video synthesis with potential CGI applications.

Abstract

Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.

MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching

TL;DR

MotionMatcher tackles motion customization for Text-to-Video diffusion by moving from pixel-level losses to high-level motion feature matching. It uses a frozen pre-trained T2V diffusion model as a motion feature extractor and leverages cross-attention maps for camera framing and temporal self-attention maps for object dynamics, combining them into a motion feature loss to fine-tune the base model via LoRA. The approach achieves state-of-the-art motion transfer while preserving the model’s prior knowledge, outperforming baselines in text alignment, frame consistency, and motion fidelity, as evidenced by quantitative metrics and human studies. This method improves joint controllability of text and motion for scene staging in AI-generated videos, offering memory-efficient, controllable video synthesis with potential CGI applications.

Abstract

Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.

Paper Structure

This paper contains 40 sections, 16 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: MotionMatcher can customize pre-traind T2V diffusion models with a user-provided reference video (top row). Once customized, the diffusion model is able to transfer the precise motion (including object movements and camera framing) in the reference video to a variety of scenes (middle and bottom rows).
  • Figure 2: Overview of MotionMatcher. (a) We fine-tune the pre-trained T2V diffusion model (T2V-DM) using the motion feature matching objective. Unlike the standard pixel-level DDPM loss, we align the motion features of the predicted noisy video $v^{\theta}_t$ with those of the ground truth noisy video $\hat{v_t}$. To extract motion features from noisy latent videos, we use a pre-trained T2V-DM (frozen) as a feature extractor. (b) We leverage the cross-attention (CA) maps and temporal self-attention (TSA) maps in the pre-trained T2V diffusion model to extract motion cues. The final motion features are the combination of the CA maps and TSA maps.
  • Figure 3: Example of cross-attention maps. We visualize the cross-attention map $M_{{\rm CA}}$, computed between the activations in T2V diffusion models and the text prompt $y$. Here we obtain the CA map by adding noise to the video and using the pre-trained diffusion model as a feature extractor. The extracted CA maps reveal the placement and shot sizes of the object associated with the word "car" in each video frame.
  • Figure 4: Example of temporal self-attention maps. We visualize the temporal self-attention map $M_{{\rm CA}}$, computed between two different frames. Here we obtain the TSA map by adding noise to the video and using the pre-trained diffusion model as a feature extractor. The extracted TSA maps describe the dynamics of the video in detail.
  • Figure 5: Qualitative comparisons. Compared to existing methods such as VMC vmc, MotionDirector md, DMT dmt, and MotionClone mc, our approach demonstrates superior text alignment and video quality, achieving high-fidelity motion transfer from reference videos to new scenes.
  • ...and 8 more figures