NewMove: Customizing text-to-video models with novel motions

Joanna Materzynska; Josef Sivic; Eli Shechtman; Antonio Torralba; Richard Zhang; Bryan Russell

NewMove: Customizing text-to-video models with novel motions

Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell

TL;DR

<3-5 sentences high-level summary with math wrapped> We introduce NewMove, a method to customize text-to-video diffusion models to learn novel motions from only a few example videos by assigning a dedicated motion token $V^*$ and fine-tuning a small subset of parameters (temporal layers and spatial cross-attention keys/values) while regularizing with real video data to prevent forgetting and using a nonuniform timestep sampling to emphasize motion over appearance. The model can generalize the learned motion to multiple subjects, backgrounds, and even non-human agents, and can combine the motion with other movements. Quantitative metrics on gesture recognition and CLIP-based alignment, plus a user study, show meaningful improvements over prior appearance-based customization and motion transfer baselines. This enables flexible, controllable motion customization in text-to-video generation with practical implications for creative video synthesis.

Abstract

We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

NewMove: Customizing text-to-video models with novel motions

TL;DR

and fine-tuning a small subset of parameters (temporal layers and spatial cross-attention keys/values) while regularizing with real video data to prevent forgetting and using a nonuniform timestep sampling to emphasize motion over appearance. The model can generalize the learned motion to multiple subjects, backgrounds, and even non-human agents, and can combine the motion with other movements. Quantitative metrics on gesture recognition and CLIP-based alignment, plus a user study, show meaningful improvements over prior appearance-based customization and motion transfer baselines. This enables flexible, controllable motion customization in text-to-video generation with practical implications for creative video synthesis.

Abstract

Paper Structure (9 sections, 3 equations, 6 figures, 3 tables)

This paper contains 9 sections, 3 equations, 6 figures, 3 tables.

Introduction
Related Work
Approach
Text-to-Video Diffusion Model Preliminaries
Approach for Motion Customization
Experiments
Qualitative Comparison and a User Study
Automated quantitative evaluation via recognition
Conclusion

Figures (6)

Figure 1: (Left) Given a few examples ("Carlton dance"), our customization method learns the dynamic motion pattern common to the input examples and incorporates it into a pre-trained text-to-video diffusion model using a new motion identifier ("V* dance"). (Right) Our approach, NewMove, abstracts the motion pattern from the appearance in the input videos and enables generation of the depicted motion across a variety of novel contexts, including with a non-humanoid subject (robot, top row), multiple motions (lady, middle row), and multiple subjects (group of nurses, bottom row). To best view the results, please view our https://joaanna.github.io/customizing_motion/ .
Figure 2: Overview. Given a small set of exemplar videos, our approach fine-tunes the U-Net of a text-to-video model using a reconstruction objective. The motion is identified with a unique motion identifier and can be used at test time to synthesize novel subjects performing the motion. To represent the added motion but preserve information from the pretrained model, we tune a subset of weights -- the temporal convolution and attention layers, in addition to the key & value layers in the spatial attention layer. A set of related videos is used to regularize the tuning process.
Figure 3: Visual comparison with baseline methods. Examples of learning a customized motion Sliding Two Fingers Up from the Jester dataset with prompt " A female firefighter doing the V* sign". Baseline methods (top three rows) fail to capture the motion and produce a temporally coherent video.
Figure 4: Qualitative results of our method. We demonstrate two custom motions: Dab and Air quotes, trained using collected internet examples as well as a 3D camera rotation trained with examples from the CO3D dataset reizenstein2021common. Our method can generalize to unseen subjects and multiple people performing the action.
Figure 5: Text-driven motion transfer methods versus our method trained on few examples of a custom motion "Shaking Hand". Our method seamlessly renders a custom motion in novel scenarios. Despite the training videos showing only a single actor performing one motion, our method generates the custom motion alongside another action (doing the gesture while eating a burger") and varies timing (doing the gesture slowly and precisely") or involves multiple people ("children"). In contrast, both baselines fail to generalize or produce temporally coherent videos.
...and 1 more figures

NewMove: Customizing text-to-video models with novel motions

TL;DR

Abstract

NewMove: Customizing text-to-video models with novel motions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)