Table of Contents
Fetching ...

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Tuna Han Salih Meral, Hidir Yesiltepe, Connor Dunlop, Pinar Yanardag

TL;DR

MotionFlow presents a training-free approach to video motion transfer by leveraging cross-attention maps from pre-trained video diffusion models. The method performs DDIM inversion to extract motion-rich attention and then refines a new video through attention-guided latent updates aligned with a target prompt. It achieves a favorable balance between motion fidelity and prompt adherence, effectively handling drastic scene changes and cross-category transfers without fine-tuning. Empirical results on DAVIS show superior performance against baselines across qualitative and quantitative metrics, reinforced by a user study confirming perceptual gains. Public code release supports replication and broad application in video editing and animation workflows.

Abstract

Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

TL;DR

MotionFlow presents a training-free approach to video motion transfer by leveraging cross-attention maps from pre-trained video diffusion models. The method performs DDIM inversion to extract motion-rich attention and then refines a new video through attention-guided latent updates aligned with a target prompt. It achieves a favorable balance between motion fidelity and prompt adherence, effectively handling drastic scene changes and cross-category transfers without fine-tuning. Empirical results on DAVIS show superior performance against baselines across qualitative and quantitative metrics, reinforced by a user study confirming perceptual gains. Public code release supports replication and broad application in video editing and animation workflows.

Abstract

Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.

Paper Structure

This paper contains 15 sections, 4 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: MotionFlow is a training-free method that leverages attention for motion transfer. Our method can successfully transfer a wide variety of motion types, ranging from simple to complex motion patterns.
  • Figure 2: Motivation. Visualization of cross-attention maps for the subject tokens, showing how MotionFlow captures and transfers motion dynamics from the original video, ensuring accurate subject motion while adhering to new edit prompts.
  • Figure 3: Overview of MotionFlow framework. Our invert-then-generate method operates in two main stages: (1) Inversion, where DDIM inversion is used to extract latent representations and cross-attention maps from the original video, generating target masks that capture the subject's motion and spatial details; (2) Generation, where these masks and a text prompt guide the creation of a new video, aligning with the original video's motion dynamics and spatial layout while adhering to the semantic content of the prompt.
  • Figure 4: Qualitative Results. MotionFlow can successfully transfer a wide variety of motion types, ranging from single to multiple motions and from simple to complex motion patterns. Additionally, it can either maintain the original scene layout or significantly alter it based on the user-provided text prompt. Please refer to the supplementary material where the actual videos are provided.
  • Figure 5: Comparison. Qualitative comparison of our method, MotionFlow, with DMT yatim2024space, MotionDirector zhao2025motiondirector, Motion Inversion wang2024motion and VMC jeong2024vmc
  • ...and 3 more figures