Video Motion Transfer with Diffusion Transformers
Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, Fabio Pizzati
TL;DR
DiTFlow addresses the challenge of transferring motion from a reference video to a target video generated by Diffusion Transformers (DiT) without retraining. It introduces Attention Motion Flow (AMF), derived from cross-frame attention within a DiT, to guide the latent denoising process and reproduce reference motion; it further enables zero-shot motion transfer by optimizing DiT positional embeddings. Across DAVIS-based benchmarks with CogVideoX backbones, DiTFlow consistently outperforms UNet-based baselines and prior diffusion-based motion transfer methods in motion fidelity and perceptual quality, with favorable human judgments. This work advances controllable, high-fidelity video synthesis by leveraging DiT attention mechanisms for explicit, patch-level motion control and flexible zero-shot capabilities.
Abstract
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.
