Table of Contents
Fetching ...

Motion Inversion for Video Customization

Luozhou Wang, Ziyang Mai, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen

TL;DR

The paper tackles motion customization in Text-to-Video generation by introducing Motion Embeddings learned from a reference video. It proposes two complementary embeddings: one that modulates temporal attention maps (motion query-key) and another that modulates attention values (motion value), with an inference-time debiasing strategy. Training backpropagates through the temporal transformer to align diffusion predictions, while inference isolates motion from appearance to enable flexible transfer across content. Extensive experiments show improvements in motion fidelity, text alignment, and temporal coherence, with ablation results confirming design choices.

Abstract

In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a differential operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments.

Motion Inversion for Video Customization

TL;DR

The paper tackles motion customization in Text-to-Video generation by introducing Motion Embeddings learned from a reference video. It proposes two complementary embeddings: one that modulates temporal attention maps (motion query-key) and another that modulates attention values (motion value), with an inference-time debiasing strategy. Training backpropagates through the temporal transformer to align diffusion predictions, while inference isolates motion from appearance to enable flexible transfer across content. Extensive experiments show improvements in motion fidelity, text alignment, and temporal coherence, with ablation results confirming design choices.

Abstract

In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a differential operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments.
Paper Structure (16 sections, 6 equations, 7 figures, 1 table)

This paper contains 16 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Applications of the proposed Motion Embeddings for customized video generation. Our method supports a wide range of motion types, including various camera movements and object motions. In each example, the first row shows the source video, while the second row shows the output. Please refer to the supplementary videos for clearer visualization.
  • Figure 2: Motion Inversion within T2V diffusion models. The top depicts the training phase, where motion embeddings $\mathcal{M}$ are learned by backpropagating the loss through the temporal transformer, influencing the spatiotemporal feature tensor $\mathbf{F}$. These embeddings are then used to modify the self-attention computations within the temporal transformer modules, ensuring enhanced inter-frame dynamics. The bottom shows the inference phase, where an input text prompt guides the generation of a coherent video sequence with the learned motion embeddings applied across the frames, producing a customized video output with desired motion attributes.
  • Figure 3: Debiasing appearance from Motion Embeddings. Left: For the Motion Query-Key Embedding, which influences the attention map, we exclude the spatial dimensions. Including them would cause the attention map between frames to capture the object's shape (e.g., the shape of the tank in the original video is visible in the attention map). Right: Following the concept of optical flow, we apply a differential operation to the Spatial-2D Motion Value Embedding, removing static appearance and preserving dynamic motion.
  • Figure 4: Sample results of our method. Our framework demonstrates exceptional adaptability in capturing a broad spectrum of movements, accurately representing everything from subtle gestures to intricate dynamic actions across various source videos. It also exhibits remarkable flexibility in responding to diverse textual prompts, enabling users to guide the synthesis process with descriptive language for customized motion outputs. Furthermore, our method seamlessly integrates with a range of T2V models such as (a) zero-scope cerspense2023zeroscope and (b) animate-diff guo2023animatediff, showcasing its effectiveness in enhancing video generation with contextually rich and varied motion patterns.
  • Figure 5: Qualitative results. Compared to DMT yatim2023space, VMC jeong2023vmc, and Motion Director zhao2023motiondirector, our method not only preserves the original video's motion trajectory and object poses but also generates visual features that align with text descriptions.
  • ...and 2 more figures