MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
Yen-Siang Wu, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
TL;DR
MotionMatcher tackles motion customization for Text-to-Video diffusion by moving from pixel-level losses to high-level motion feature matching. It uses a frozen pre-trained T2V diffusion model as a motion feature extractor and leverages cross-attention maps for camera framing and temporal self-attention maps for object dynamics, combining them into a motion feature loss $\mathcal{L}_{\rm mot}$ to fine-tune the base model via LoRA. The approach achieves state-of-the-art motion transfer while preserving the model’s prior knowledge, outperforming baselines in text alignment, frame consistency, and motion fidelity, as evidenced by quantitative metrics and human studies. This method improves joint controllability of text and motion for scene staging in AI-generated videos, offering memory-efficient, controllable video synthesis with potential CGI applications.
Abstract
Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.
