Table of Contents
Fetching ...

Mojito: Motion Trajectory and Intensity Control for Video Generation

Xuehai He, Shuohang Wang, Jianwei Yang, Xiaoxia Wu, Yiping Wang, Kuan Wang, Zheng Zhan, Olatunji Ruwase, Yelong Shen, Xin Eric Wang

TL;DR

Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation, is introduced, providing realistic dynamics that align well with natural motion in real-world scenarios.

Abstract

Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training video diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. To tackle these challenges, this paper introduces Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation. Specifically, Mojito features a Directional Motion Control (DMC) module that leverages cross-attention to efficiently direct the generated object's motion without training, alongside a Motion Intensity Modulator (MIM) that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.

Mojito: Motion Trajectory and Intensity Control for Video Generation

TL;DR

Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation, is introduced, providing realistic dynamics that align well with natural motion in real-world scenarios.

Abstract

Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training video diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. To tackle these challenges, this paper introduces Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation. Specifically, Mojito features a Directional Motion Control (DMC) module that leverages cross-attention to efficiently direct the generated object's motion without training, alongside a Motion Intensity Modulator (MIM) that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.

Paper Structure

This paper contains 38 sections, 7 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Mojito generates videos that accurately follow specified directions, locations, and trajectories, while adapting to varying input motion intensities. (a) Directional Motion Control: the object (leaf, firefly) in the generated videos can follow input bounding boxes or trajectories over time. (b) Motion Intensity Modulator: increasing input motion intensity levels results in a corresponding increase in motion, transforming a relatively static scene into one with more dynamic movement. Additional examples can be found at https://sites.google.com/view/mojito-video.
  • Figure 2: Overview of the Mojito framework. In the training pipeline (top), Mojito uses a VAE Encoder to transform input frames into latent features, processed by Spatial and Temporal Transformers within the U-Net. Motion intensity control is introduced through the Motion Intensity Modulator, consisting of the Optical Flow Map Generator and the Motion Intensity Encoder. The Directional Motion Control module interprets object phrases within the prompt to align attention with specified trajectories. During inference (bottom), Mojito generates videos following user-defined motion intensity and directional guidance.
  • Figure 3: Overview of the Directional Motion Control module. The cross-attention map for the chosen word token in the given guidance step is marked with a red border. We compute the loss and perform backpropagation during inference time to update latents.
  • Figure 4: Qualitative comparison of directional control with Tora. Mojito achieves motion control comparable to Tora while offering additional capabilities to specify objects and precise locations without training. The red bounding boxes, serving as inputs to Mojito, guide the balloon to follow the specified trajectory.
  • Figure 5: (a) Ablation Study on Temporal Smoothness Loss: Without temporal smoothness loss, the generated sailboat exhibits inconsistencies across frames. (b) Ablation Study on Guidance Strength: Adjusting the guidance strength demonstrates a trade-off between video quality and trajectory alignment.
  • ...and 7 more figures