Table of Contents
Fetching ...

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen

TL;DR

MotionBooth introduces a motion-aware framework for customized text-to-video generation that learns a target subject from a few images using subject region loss, video preservation loss, and a subject token cross-attention loss. It enables training-free control of subject and camera motion at inference via cross-attention map editing and a latent shift module, respectively. Across two base diffusion models, MotionBooth achieves superior subject fidelity, motion alignment, and video quality compared with state-of-the-art baselines, while maintaining generalizability. Limitations include handling multiple objects and complex motions, pointing to future work on multi-subject scenarios and broader integration with more powerful T2V models.

Abstract

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth

MotionBooth: Motion-Aware Customized Text-to-Video Generation

TL;DR

MotionBooth introduces a motion-aware framework for customized text-to-video generation that learns a target subject from a few images using subject region loss, video preservation loss, and a subject token cross-attention loss. It enables training-free control of subject and camera motion at inference via cross-attention map editing and a latent shift module, respectively. Across two base diffusion models, MotionBooth achieves superior subject fidelity, motion alignment, and video quality compared with state-of-the-art baselines, while maintaining generalizability. Limitations include handling multiple objects and complex motions, pointing to future work on multi-subject scenarios and broader integration with more powerful T2V models.

Abstract

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth
Paper Structure (21 sections, 8 equations, 15 figures, 6 tables)

This paper contains 21 sections, 8 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Motion-aware customized video generation results of MotionBooth. Our method animates a customized object with controllable subject and camera motions.
  • Figure 2: The overall pipeline of MotionBooth. We first fine-tune a T2V model on the subject. This procedure incorporates subject region loss, video preservation loss, and subject token cross-attention loss. During inference, we control the camera movement with a novel latent shift module. At the same time, we manipulate the cross-attention maps to govern the subject motion.
  • Figure 3: Case study on subject learning. "Region" indicates subject region loss. "Video" indicates video preservation loss. The images are extracted from generated videos.
  • Figure 4: Case study on subject token cross-attention maps. (b) and (c) are visualization of cross-attention maps on tokens "[V]" and "dog".
  • Figure 5: Illustration of camera movement control through shifting the noised latent.
  • ...and 10 more figures