Table of Contents
Fetching ...

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, Hongming Shan

TL;DR

DreamVideo-2 tackles zero-shot subject-driven video customization with precise motion control by introducing reference attention for subject learning and a mask-guided motion module guided by bounding-box masks. To counter motion-dominance, it employs blended latent masks in reference attention and a reweighted diffusion loss that emphasizes inside-box regions, all trained on a newly curated large single-subject dataset. The approach is tuning-free at inference and demonstrates superior performance over state-of-the-art baselines in both subject fidelity and motion accuracy. The work also provides dataset, code, and models to support reproducibility and broader adoption.

Abstract

Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

TL;DR

DreamVideo-2 tackles zero-shot subject-driven video customization with precise motion control by introducing reference attention for subject learning and a mask-guided motion module guided by bounding-box masks. To counter motion-dominance, it employs blended latent masks in reference attention and a reweighted diffusion loss that emphasizes inside-box regions, all trained on a newly curated large single-subject dataset. The approach is tuning-free at inference and demonstrates superior performance over state-of-the-art baselines in both subject fidelity and motion accuracy. The work also provides dataset, code, and models to support reproducibility and broader adoption.

Abstract

Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

Paper Structure

This paper contains 21 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Customized video generation results of DreamVideo-2. Our method precisely generates customized subjects at specified positions without fine-tuning at inference time.
  • Figure 2: Overall framework of DreamVideo-2. During training, a random video frame is segmented to obtain the subject image with a blank background. The bounding boxes extracted from the training video are converted into binary box masks. Then, the subject image is treated as a single-frame video and processed in parallel with the video by masked reference attention that incorporates blended masks to learn the subject appearance. Meanwhile, box masks are fed into a motion module that includes a spatiotemporal encoder and a ControlNet for motion control. Both the masked reference attention and motion module are trained using a reweighted diffusion loss.
  • Figure 3: Illustration of motion control domination in DreamVideo-2. As seen in (b) and (c), motion control tends to dominate over subject learning during training, causing the degradation of subject identity. In (d), our method ensures a balance between subject and motion control.
  • Figure 4: Qualitative comparison of joint subject customization and motion control. DreamVideo-2 generates videos with customized subjects and precise motion trajectory control, while other methods suffer from control conflicts, especially when trained on a single image.
  • Figure 5: Qualitative comparison of subject customization. DreamVideo-2 generates videos with accurate subject appearance and enhanced motion dynamics, aligning with provided prompts.
  • ...and 6 more figures