Table of Contents
Fetching ...

Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation

Yuval Atzmon, Rinon Gal, Yoad Tewel, Yoni Kasten, Gal Chechik

TL;DR

The paper reveals that self-attention query features in text-to-video diffusion models govern both motion and identity, causing entanglement that complicates motion transfer and multi-shot consistency. It introduces Motion by Queries for zero-shot motion transfer with superior efficiency and develops a two-phase Q intervention (Q-Preservation then Q-Flow) to achieve training-free, consistent multi-shot video generation. Through extensive experiments and ablations, the authors show how Q-injection affects identity leakage and motion fidelity, offering practical techniques to balance these factors. The work advances understanding of Q representations in video diffusion and provides actionable methods for more controllable, coherent video generation. It also discusses limitations and directions for improving identity-motion disentanglement in future work.

Abstract

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method - implemented with VideoCrafter2 and WAN 2.1 - that is 10 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation

TL;DR

The paper reveals that self-attention query features in text-to-video diffusion models govern both motion and identity, causing entanglement that complicates motion transfer and multi-shot consistency. It introduces Motion by Queries for zero-shot motion transfer with superior efficiency and develops a two-phase Q intervention (Q-Preservation then Q-Flow) to achieve training-free, consistent multi-shot video generation. Through extensive experiments and ablations, the authors show how Q-injection affects identity leakage and motion fidelity, offering practical techniques to balance these factors. The work advances understanding of Q representations in video diffusion and provides actionable methods for more controllable, coherent video generation. It also discusses limitations and directions for improving identity-motion disentanglement in future work.

Abstract

Text-to-video diffusion models have shown remarkable progress in generating coherent video clips from textual descriptions. However, the interplay between motion, structure, and identity representations in these models remains under-explored. Here, we investigate how self-attention query (Q) features simultaneously govern motion, structure, and identity and examine the challenges arising when these representations interact. Our analysis reveals that Q affects not only layout, but that during denoising Q also has a strong effect on subject identity, making it hard to transfer motion without the side-effect of transferring identity. Understanding this dual role enabled us to control query feature injection (Q injection) and demonstrate two applications: (1) a zero-shot motion transfer method - implemented with VideoCrafter2 and WAN 2.1 - that is 10 times more efficient than existing approaches, and (2) a training-free technique for consistent multi-shot video generation, where characters maintain identity across multiple video shots while Q injection enhances motion fidelity.

Paper Structure

This paper contains 31 sections, 9 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: https://research.nvidia.com/labs/par/MotionByQueries/#teaser Our analysis reveals differences in Q-injection between text-to-video and text-to-image models. One key observation is that in text-to-video models, zero-shot Q injection can transfer structure and motion between different video shots. However, when a target video is prompted for the same subject, Q injection suffers from identity leakage.
  • Figure 2: Same-class motion transfer suffers from identity leakage (purple), worsening with increased Q injection. Cross-class transfer (green) achieves reasonable separation at 40% injection, where motion quality is also preserved. The leftmost purple data point shows results for random same-class images (no injection). Quantitative results use motion-transfer data from motion_inversion. For Illustration we use frames from the videos in Fig. \ref{['fig1']}.
  • Figure 3: Motion is compromised with extended attention across video-shots. To recover the motion, longer q-injection periods are required, which consequently increases identity leakage.
  • Figure 4: https://research.nvidia.com/labs/par/MotionByQueries/#fig_qual Qualitative Results, Motion Transfer (VideoCrafter2). Each pair of rows are frames from source (pair-top) real video and target (pair-bottom) generated video. Transferring Motion by Queries allows to use source videos to inject camera motion (top), non-rigid movement (middle), and combinations of movements (bottom). See the supplemental for videos and more examples.
  • Figure 5: https://research.nvidia.com/labs/par/MotionByQueries/#fig_compare Qualitative Comparisons, Motion Transfer.
  • ...and 16 more figures