NewMove: Customizing text-to-video models with novel motions
Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell
TL;DR
<3-5 sentences high-level summary with math wrapped> We introduce NewMove, a method to customize text-to-video diffusion models to learn novel motions from only a few example videos by assigning a dedicated motion token $V^*$ and fine-tuning a small subset of parameters (temporal layers and spatial cross-attention keys/values) while regularizing with real video data to prevent forgetting and using a nonuniform timestep sampling to emphasize motion over appearance. The model can generalize the learned motion to multiple subjects, backgrounds, and even non-human agents, and can combine the motion with other movements. Quantitative metrics on gesture recognition and CLIP-based alignment, plus a user study, show meaningful improvements over prior appearance-based customization and motion transfer baselines. This enables flexible, controllable motion customization in text-to-video generation with practical implications for creative video synthesis.
Abstract
We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.
