MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models
Tuna Han Salih Meral, Hidir Yesiltepe, Connor Dunlop, Pinar Yanardag
TL;DR
MotionFlow presents a training-free approach to video motion transfer by leveraging cross-attention maps from pre-trained video diffusion models. The method performs DDIM inversion to extract motion-rich attention and then refines a new video through attention-guided latent updates aligned with a target prompt. It achieves a favorable balance between motion fidelity and prompt adherence, effectively handling drastic scene changes and cross-category transfers without fine-tuning. Empirical results on DAVIS show superior performance against baselines across qualitative and quantitative metrics, reinforced by a user study confirming perceptual gains. Public code release supports replication and broad application in video editing and animation workflows.
Abstract
Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.
