Table of Contents
Fetching ...

Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen

TL;DR

The paper addresses the challenge of transferring complex motion in video diffusion transformers without suffering from motion inconsistency or prohibitive training cost. It introduces Follow-Your-Motion, a two-stage framework that decouples spatial appearance and temporal motion learning through attention-head classification and spatial-temporal LoRA, complemented by sparse motion sampling and adaptive RoPE to speed up tuning and improve motion interpolation. A motion-focused loss further enforces temporal consistency, and a new MotionBench benchmark provides a rigorous, diverse evaluation suite. Empirical results show state-of-the-art performance across varied motion scenarios, confirming both effectiveness and efficiency of the proposed approach.

Abstract

Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

TL;DR

The paper addresses the challenge of transferring complex motion in video diffusion transformers without suffering from motion inconsistency or prohibitive training cost. It introduces Follow-Your-Motion, a two-stage framework that decouples spatial appearance and temporal motion learning through attention-head classification and spatial-temporal LoRA, complemented by sparse motion sampling and adaptive RoPE to speed up tuning and improve motion interpolation. A motion-focused loss further enforces temporal consistency, and a new MotionBench benchmark provides a rigorous, diverse evaluation suite. Empirical results show state-of-the-art performance across varied motion scenarios, confirming both effectiveness and efficiency of the proposed approach.

Abstract

Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

Paper Structure

This paper contains 11 sections, 3 equations, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Showcases of our Follow-Your-Motion. Given an input video, Follow-Your-Motion enables generating the video with the same motion, including motion of single or multiple objects, complex poses of humans, and movements of the camera view.
  • Figure 2: Comparison between Follow-Your-Motion and baseline. We finetune the baseline and our method 3,000 steps using Wan2.1 wan2025. Our method gets better reconstruction and motion preservation.
  • Figure 2: Quantitative ablation. Red and Blue denote best, 2nd.
  • Figure 3: Overview of our methods.Stage 1: We first classify the attention heads using a pseudo spatial attention map. Stage 2: After attention classification, we first tune the spatial LoRA using a random frame in the video. Stage 3: After finishing spatial LoRA tuning, we load the spatial LoRA weight and conduct temporal tuning using sparse motion sampling and adaptive RoPE.
  • Figure 4: Illustration of sparse motion sampling and adaptive RoPE. The adaptive RoPE is utilized to represent frame position in the video.
  • ...and 4 more figures