EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen
TL;DR
EfficientMT addresses the difficulty of controlling motion in text-to-video diffusion by end-to-end adapting a pretrained T2V model using a small set of synthetic motion-transfer samples. It reuses the T2V backbone to extract temporal motion cues, introduces a scaler to distill this information, and applies a temporal integration mechanism to inject motion features throughout the generation process, enabling zero-shot motion transfer without test-time optimization. The method is trained on carefully constructed synthetic paired data and demonstrates faster motion-pattern capture while maintaining flexible motion controllability, outperforming several baselines in efficiency and delivering competitive motion fidelity and temporal consistency. The work thus offers a practical, scalable approach to precise motion control in T2V generation with broad applicability across models and scenes.
Abstract
The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.
