Table of Contents
Fetching ...

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Penghui Ruan, Pichao Wang, Divya Saxena, Jiannong Cao, Yuhui Shi

TL;DR

This work proposes a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components and introduces text-motion and video-motion supervision to improve the model's understanding and generation of motion.

Abstract

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https://PR-Ryan.github.io/DEMO-project/

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

TL;DR

This work proposes a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components and introduces text-motion and video-motion supervision to improve the model's understanding and generation of motion.

Abstract

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https://PR-Ryan.github.io/DEMO-project/

Paper Structure

This paper contains 22 sections, 11 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Our Pilot Study. We generated a set of prompts (262144 in total) following a fixed template, grouping them according to the different parts of speech (POS). These grouped texts are then passed into the CLIP text encoder, and we calculate the sensitivity as the average sentence distance within each group. As shown on the left-hand side, compared to POS representing content, CLIP is less sensitive to POS representing motion. (Results are consistent across different templates and different sets of words within each POS. Further details can be found in the appendix.)
  • Figure 2: Overview of DEMO Training. As shown in the left-hand side, DEMO incorporate dual text encoding and text conditioning (for simplicity, other layers in the UNet are omitted). As shown in the right-hand side, during training, the $\mathcal{L}_{\text{text-motion}}$ is used to enhance motion encoding, the $\mathcal{L}_{\text{reg}}$ is used to avoid catastrophic forgetting, the $\mathcal{L}_{\text{video-motion}}$ is to enhance motion integration. The snowflakes and flames denote frozen and trainable parameters, respectively.
  • Figure 3: Qualitative Comparison. Each video is generated with 16 frames. We display frames 1, 2, 4, 6, 8, 10, 12, 14, 15, and 16, arranged in two rows from left to right. Full videos are available in the supplementary materials.
  • Figure 4: Limitations. DEMO does not support creating videos containing sequential motions specified by text. As shown in the example, two motions,"a man standing in a kitchen and talking" and "a mixer and a carton of milk are shown", appear simultaneously.
  • Figure 5: Extended qualitative comparison. Each video is generated with 16 frames. We display frames 1, 2, 4, 6, 8, 10, 12, 14, 15, and 16, arranged in two rows from left to right. Full videos are available in the supplementary materials.
  • ...and 3 more figures