Table of Contents
Fetching ...

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel

TL;DR

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer presents a zero-shot framework that uses a pre-trained text-to-video diffusion model to transfer motion between objects with substantial shape and motion differences described by text prompts. The core contribution is a Space-Time analysis revealing that Spatial Marginal Mean features capture motion and layout while being robust to appearance, enabling a Pairwise SMM Differences loss to guide generation. The method uses DDIM inversion, a low-frequency latent initialization, and optimization to produce edited videos that preserve input motion while aligning to the target prompt, outperforming baselines on both qualitative and quantitative measures, including a new Motion-Fidelity-Score and human judgments. This work demonstrates effective utilization of learned diffusion priors for cross-category video editing and highlights remaining limitations of current public T2V models.

Abstract

We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

TL;DR

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer presents a zero-shot framework that uses a pre-trained text-to-video diffusion model to transfer motion between objects with substantial shape and motion differences described by text prompts. The core contribution is a Space-Time analysis revealing that Spatial Marginal Mean features capture motion and layout while being robust to appearance, enabling a Pairwise SMM Differences loss to guide generation. The method uses DDIM inversion, a low-frequency latent initialization, and optimization to produce edited videos that preserve input motion while aligning to the target prompt, outperforming baselines on both qualitative and quantitative measures, including a new Motion-Fidelity-Score and human judgments. This work demonstrates effective utilization of learned diffusion priors for cross-category video editing and highlights remaining limitations of current public T2V models.

Abstract

We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.
Paper Structure (14 sections, 8 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 8 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Given an input video and a text prompt describing the target objects and scene, our method generates a new video in which the overall motion and scene layout of the input video are preserved, while allowing for notable structural and appearance changes.
  • Figure 2: Diffusion feature inversion via guided feature reconstruction. We extract space-time features ${\boldsymbol{f}}$ from an input video (a) and steer the generation process of a random sample to produce the same feature ${\boldsymbol{f}}$, using feature reconstruction as guidance (b); the synthesized videos closely resemble the original video content in terms of appearance, shape, and pose. Replacing the full space-time features with their spatial marginal mean feature $\texttt{SMM}[\boldsymbol{f}]$ allows for more flexibility (c); the SMM feature inversion results capture the original object pose, general position, and scene layout yet are not restricted to the original content at the pixel-level. This is also demonstrated in the nearest neighbor frames retrieved from other videos depicting similar actions, according to similarly in $\texttt{SMM}[\boldsymbol{f}]$ features (c).
  • Figure 3: Pipeline. (a) Given an input video, we apply DDIM inversion and extract space-time features ${\boldsymbol{f}}\in \mathbb{R}^{F\times M \times N \times D}$ from intermediate layer activations. We obtain our Spatial Marginal Mean (SMM) feature $\texttt{SMM}[\boldsymbol{f}] \in \mathbb{R}^{F \times D}$ by computing the mean over the spatial dimensions, and compute the pairwise differences between each pair of SMM features. (b) For editing, we guide the generation at each denoising step with our Pairwise SMM differences objective (b). See Sec. \ref{['sec:method']} for more details.
  • Figure 4: Sample results of our method. See SM for full video results.
  • Figure 5: Comparison to SA-NLA lee2023shape. See SM for video results.
  • ...and 4 more figures