Table of Contents
Fetching ...

EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

Yufei Cai, Hu Han, Yuxiang Wei, Shiguang Shan, Xilin Chen

TL;DR

EfficientMT addresses the difficulty of controlling motion in text-to-video diffusion by end-to-end adapting a pretrained T2V model using a small set of synthetic motion-transfer samples. It reuses the T2V backbone to extract temporal motion cues, introduces a scaler to distill this information, and applies a temporal integration mechanism to inject motion features throughout the generation process, enabling zero-shot motion transfer without test-time optimization. The method is trained on carefully constructed synthetic paired data and demonstrates faster motion-pattern capture while maintaining flexible motion controllability, outperforming several baselines in efficiency and delivering competitive motion fidelity and temporal consistency. The work thus offers a practical, scalable approach to precise motion control in T2V generation with broad applicability across models and scenes.

Abstract

The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.

EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

TL;DR

EfficientMT addresses the difficulty of controlling motion in text-to-video diffusion by end-to-end adapting a pretrained T2V model using a small set of synthetic motion-transfer samples. It reuses the T2V backbone to extract temporal motion cues, introduces a scaler to distill this information, and applies a temporal integration mechanism to inject motion features throughout the generation process, enabling zero-shot motion transfer without test-time optimization. The method is trained on carefully constructed synthetic paired data and demonstrates faster motion-pattern capture while maintaining flexible motion controllability, outperforming several baselines in efficiency and delivering competitive motion fidelity and temporal consistency. The work thus offers a practical, scalable approach to precise motion control in T2V generation with broad applicability across models and scenes.

Abstract

The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.

Paper Structure

This paper contains 16 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Generation results of our EfficientMT. Based on a pretrained T2V model, EfficientMT performs zero-shot transfer of both subject and camera motion only in inference time. Please refer to the supplementary materials for better visual evaluation.
  • Figure 2: Comparison of methods. Our method inherits strength of both methods, achieving efficient and flexible motion transfer.
  • Figure 3: Overview of our EfficientMT.(a): We reuse the backbone of the T2V model to extract reference features, which are then injected into the temporal attention layers of the upsampling stage through a temporal integration mechanism. (b): The scaler predicts a fine-grained scale map for the reference features, filtering out irrelevant information. (c): The temporal integration concatenates the features along the temporal axis, while the query is projected from the origin, the key and value are obtained from the integrated features.
  • Figure 4: Visual comparisons on the effect of the integration scale. As the injection scale of reference features increases, the control over the generated content becomes more pronounced. Introducing a scaler enhances the robustness of the generation.
  • Figure 5: Visual comparisons. Our EfficientMT enables the unified transfer of subject and camera motion. Compared to state-of-the-art methods, our method offers superior editing flexibility and motion fidelity. Zoom up for a better view.
  • ...and 3 more figures