Table of Contents
Fetching ...

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li

TL;DR

CustomCrafter tackles the challenge of personalized video generation by learning a subject's appearance without extra video data or model re-tuning. It introduces a LoRA-based Spatial Subject Learning Module that updates both cross- and self-attention to preserve concept combination, and a Dynamic Weighted Video Sampling Strategy to reduce early-stage disturbance to motion during denoising. By decoupling appearance learning from motion in the denoising process and employing prior-preservation during training, the method preserves the diffusion model's inherent motion generation while achieving high subject fidelity. Experimental results across 20 subjects show improvements in subject fidelity and text alignment over prior methods, with favorable user-study feedback and strong ablation support for the proposed components.

Abstract

Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods. Code is available at https://github.com/WuTao-CS/CustomCrafter

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

TL;DR

CustomCrafter tackles the challenge of personalized video generation by learning a subject's appearance without extra video data or model re-tuning. It introduces a LoRA-based Spatial Subject Learning Module that updates both cross- and self-attention to preserve concept combination, and a Dynamic Weighted Video Sampling Strategy to reduce early-stage disturbance to motion during denoising. By decoupling appearance learning from motion in the denoising process and employing prior-preservation during training, the method preserves the diffusion model's inherent motion generation while achieving high subject fidelity. Experimental results across 20 subjects show improvements in subject fidelity and text alignment over prior methods, with favorable user-study feedback and strong ablation support for the proposed components.

Abstract

Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods. Code is available at https://github.com/WuTao-CS/CustomCrafter
Paper Structure (20 sections, 4 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of our approach with previous work. Our method can better learn the appearance of the subject while preserving the concept combination ability and motion generation ability, only requires one stage of training without additional videos. DWV Sampling Strategy is our Dynamic Weighted Video Sampling Strategy.
  • Figure 2: Visualization for our CustomCrafter. Our approach allows customization of subject identity and movement patterns to generate the desired video with text prompt by preserving motion generation and conceptual combination abilities.
  • Figure 3: Visualization of video denoising process. The motion is formed in early stages of the denoising process, and the subject's appearance emerges in later stages.
  • Figure 4: Overall review of CustomCrafter. For subject learning, we adopt LoRA to construct Spatial Subject Learning Module, which update the Query, Key, and Value parameters of attention layers in all Spatial Transformer models. In the process of generating videos, we divide the denoising process into two phases: the motion layout repair process and the subject appearance repair process. By reducing the influence of the Spatial Subject Learning Module in the motion layout repair process, and restoring it in the subject appearance repair process to repair the details of the subject.
  • Figure 5: Qualitative comparison of customized video generation with both subjects and motions. Without guidance from additional videos, our method significantly outperforms in terms of concept combination.
  • ...and 2 more figures