Table of Contents
Fetching ...

MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Hidir Yesiltepe, Tuna Han Salih Meral, Connor Dunlop, Pinar Yanardag

TL;DR

MotionShop presents Mixture of Score Guidance (MSG), a theoretically grounded, training-free framework for zero-shot motion transfer in diffusion-based video models. By decomposing conditional scores into motion and content components and interpreting score mixing as a mixture of potential energies, MSG links motion transfer to stabilized Langevin dynamics and enables faithful transfer across single/multi-object and complex camera motions. The approach operates directly on pre-trained video diffusion models, avoiding fine-tuning, and is supported by extensive qualitative and quantitative experiments alongside MotionBench, a new 200-source, 1,000-transfer-motion dataset. MotionShop demonstrates superior motion fidelity and temporal consistency while preserving scene content, with a principled trade-off against text alignment that favors robust motion transfer. The work advances practical motion editing in diffusion-based video generation and provides a standardized benchmark to evaluate future motion-transfer methods.

Abstract

In this work, we propose the first motion transfer approach in diffusion transformer through Mixture of Score Guidance (MSG), a theoretically-grounded framework for motion transfer in diffusion models. Our key theoretical contribution lies in reformulating conditional score to decompose motion score and content score in diffusion models. By formulating motion transfer as a mixture of potential energies, MSG naturally preserves scene composition and enables creative scene transformations while maintaining the integrity of transferred motion patterns. This novel sampling operates directly on pre-trained video diffusion models without additional training or fine-tuning. Through extensive experiments, MSG demonstrates successful handling of diverse scenarios including single object, multiple objects, and cross-object motion transfer as well as complex camera motion transfer. Additionally, we introduce MotionBench, the first motion transfer dataset consisting of 200 source videos and 1000 transferred motions, covering single/multi-object transfers, and complex camera motions.

MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

TL;DR

MotionShop presents Mixture of Score Guidance (MSG), a theoretically grounded, training-free framework for zero-shot motion transfer in diffusion-based video models. By decomposing conditional scores into motion and content components and interpreting score mixing as a mixture of potential energies, MSG links motion transfer to stabilized Langevin dynamics and enables faithful transfer across single/multi-object and complex camera motions. The approach operates directly on pre-trained video diffusion models, avoiding fine-tuning, and is supported by extensive qualitative and quantitative experiments alongside MotionBench, a new 200-source, 1,000-transfer-motion dataset. MotionShop demonstrates superior motion fidelity and temporal consistency while preserving scene content, with a principled trade-off against text alignment that favors robust motion transfer. The work advances practical motion editing in diffusion-based video generation and provides a standardized benchmark to evaluate future motion-transfer methods.

Abstract

In this work, we propose the first motion transfer approach in diffusion transformer through Mixture of Score Guidance (MSG), a theoretically-grounded framework for motion transfer in diffusion models. Our key theoretical contribution lies in reformulating conditional score to decompose motion score and content score in diffusion models. By formulating motion transfer as a mixture of potential energies, MSG naturally preserves scene composition and enables creative scene transformations while maintaining the integrity of transferred motion patterns. This novel sampling operates directly on pre-trained video diffusion models without additional training or fine-tuning. Through extensive experiments, MSG demonstrates successful handling of diverse scenarios including single object, multiple objects, and cross-object motion transfer as well as complex camera motion transfer. Additionally, we introduce MotionBench, the first motion transfer dataset consisting of 200 source videos and 1000 transferred motions, covering single/multi-object transfers, and complex camera motions.

Paper Structure

This paper contains 26 sections, 12 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Mixture of Score Guidance (MSG), a novel approach for zero-shot motion transfer in diffusion models, enables high-fidelity motion synthesis across diverse scenarios. MSG successfully handles various motion patterns including complex object movements and camera trajectories. Full video results are available in the supplementary material.
  • Figure 2: Our intuition. Visualization of motion characteristics $\mathcal{M}(z)$ extracted from early-timestep conditional scores. (Left) Multiple object motion representation showing the simultaneous movement of two objects. (Right) Combined object and camera motion representation demonstrating how our method captures both local object motion and global camera movement patterns. The visualizations are obtained from the conditional score maps $\nabla_{z_t} \log p_t(z|y)$ at early timesteps $t \ll T$.
  • Figure 3: Method Overview. Framework of our Mixture of Score Guidance (MSG) for zero-shot motion transfer in diffusion models. Left: Reference motion extraction stage captures motion characteristics $M(z)$ from early-timestep conditional scores $\nabla_z \log p(z^{(1)}|y^{(1)})$ and $\nabla_z \log p(z^{(2)}|y^{(2)})$. Middle: Motion transfer combines content and motion scores through our MSG formulation $s_{\text{MSG}}(z_t, z_t^*) = \nabla_z \log p_t(z|y) + w_{\text{MSG}}(\nabla_z \log p_t(z^*|y^*) - \nabla_z \log p_t(z))$. Right: MSG path redirection mechanism showing attention-guided dynamics that enable stable motion transfer by exploring the correct motion manifold while preserving content through modified Langevin dynamics governed by our mixture of potential energies $U_{\text{MSG}}(z_t) = U_{\text{content}}(z_t) + w_{\text{MSG}}[U_{\text{motion}}(z_t, z_t^*) - U_{\text{prior}}(z_t)]$.
  • Figure 4: Qualitative results demonstrating our method's ability to preserve motion priors while generating novel content from text prompts. (Left) Single-object motion transfer where complex motions like mechanical movements, horseback riding sequences are accurately preserved in the generated outputs. (Right) Multi-object scenarios where our method successfully maintains the original motion dynamics while generating diverse subjects. Please refer to the Supplementary Material for full videos and additional examples.
  • Figure 5: Qualitative comparison of motion transfer capabilities. We compare MotionShop (bottom row) with existing methods (VMC, DMT, MD, MI) on three challenging scenarios. Left: Single object motion transfer of a robot-driven motorcycle in a desert scene. Middle: Multiple object motion transfer involving miniature medieval knights, demonstrating the ability to preserve interactions between objects. Right: Camera motion transfer capturing the dynamic perspective of a raindrop on a leaf. Our method demonstrates superior motion-text alignment across all three motion transfer categories.
  • ...and 6 more figures