Table of Contents
Fetching ...

Video Motion Transfer with Diffusion Transformers

Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, Fabio Pizzati

TL;DR

DiTFlow addresses the challenge of transferring motion from a reference video to a target video generated by Diffusion Transformers (DiT) without retraining. It introduces Attention Motion Flow (AMF), derived from cross-frame attention within a DiT, to guide the latent denoising process and reproduce reference motion; it further enables zero-shot motion transfer by optimizing DiT positional embeddings. Across DAVIS-based benchmarks with CogVideoX backbones, DiTFlow consistently outperforms UNet-based baselines and prior diffusion-based motion transfer methods in motion fidelity and perceptual quality, with favorable human judgments. This work advances controllable, high-fidelity video synthesis by leveraging DiT attention mechanisms for explicit, patch-level motion control and flexible zero-shot capabilities.

Abstract

We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.

Video Motion Transfer with Diffusion Transformers

TL;DR

DiTFlow addresses the challenge of transferring motion from a reference video to a target video generated by Diffusion Transformers (DiT) without retraining. It introduces Attention Motion Flow (AMF), derived from cross-frame attention within a DiT, to guide the latent denoising process and reproduce reference motion; it further enables zero-shot motion transfer by optimizing DiT positional embeddings. Across DAVIS-based benchmarks with CogVideoX backbones, DiTFlow consistently outperforms UNet-based baselines and prior diffusion-based motion transfer methods in motion fidelity and perceptual quality, with favorable human judgments. This work advances controllable, high-fidelity video synthesis by leveraging DiT attention mechanisms for explicit, patch-level motion control and flexible zero-shot capabilities.

Abstract

We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.

Paper Structure

This paper contains 33 sections, 8 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of DiTFlow. We propose a motion transfer method tailored for video Diffusion Transformers (DiT). We exploit a training-free strategy to transfer the motion of a reference video (top) to newly synthesized video content with arbitrary prompts (bottom). By optimizing DiT-specific positional embeddings, we can also synthesize new videos in a zero-shot manner.
  • Figure 2: Core idea of DiTFlow. We extract the AMF from a reference video and we use that to guide the latent representation $z_t$ towards the motion of the reference video. In our experiments, we also tested optimizing positional embeddings for improved zero-shot performance.
  • Figure 3: Guidance. We compute the reference displacement by processing cross-frame attentions with an argmax operation and rearranging them into displacement maps, identifying patch-aware cross-frame relationships. For video synthesis, we do the same operation with a soft argmax to preserve gradients, and impose reconstruction with the reference displacement.
  • Figure 4: Baseline comparison. Baselines associate motion to wrong elements due to poor layout representation typical of UNet-based approaches that do spatial averaging or only consider deviations at each location. DiTFlow captures the spatio-temporal motion of each patch, resulting in correct spatial positioning and sizing of moving elements, e.g. the dog (left), the bear (middle), the parachute (right).
  • Figure 5: Qualitative results of DiTFlow. We are able to perform motion transfer in various conditions. Note how varying the prompt completely changes the scene's appearance while maintaining consistent motion. We map motion to correct elements even in cases where the motion changes drastically in positioning and size (bottom right).
  • ...and 4 more figures