Table of Contents
Fetching ...

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models

Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Yaofei Wu, Wenwu Zhu

TL;DR

This work tackles customized multi-subject text-to-video generation, a problem where multiple user-defined subjects must be preserved while following a textual prompt. It introduces VideoDreamer, which builds on a Stable Diffusion-based base with temporal modules, and employs Disen-Mix Finetuning to address attribute binding across subjects, complemented by optional Human-in-the-Loop Re-finetuning and a disentangled motion customization pipeline. The authors present MultiStudioBench, a dedicated benchmark with 25 subjects and multiple subject combinations to evaluate subject fidelity, prompt fidelity, temporal consistency, and artifacts, demonstrating that VideoDreamer achieves superior performance and fewer stitches than baselines. Ablation studies show the value of disentangled embeddings, the weak denoising objective, and the combination of separate and mixed data finetuning, while also revealing limitations for higher-order multi-subject scenarios and dynamic backgrounds. Overall, the approach advances practical multi-subject T2V by enabling new events and motions with preserved identities, suggesting promising directions for scalable, fine-grained video customization.

Abstract

Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models

TL;DR

This work tackles customized multi-subject text-to-video generation, a problem where multiple user-defined subjects must be preserved while following a textual prompt. It introduces VideoDreamer, which builds on a Stable Diffusion-based base with temporal modules, and employs Disen-Mix Finetuning to address attribute binding across subjects, complemented by optional Human-in-the-Loop Re-finetuning and a disentangled motion customization pipeline. The authors present MultiStudioBench, a dedicated benchmark with 25 subjects and multiple subject combinations to evaluate subject fidelity, prompt fidelity, temporal consistency, and artifacts, demonstrating that VideoDreamer achieves superior performance and fewer stitches than baselines. Ablation studies show the value of disentangled embeddings, the weak denoising objective, and the combination of separate and mixed data finetuning, while also revealing limitations for higher-order multi-subject scenarios and dynamic backgrounds. Overall, the approach advances practical multi-subject T2V by enabling new events and motions with preserved identities, suggesting promising directions for scalable, fine-grained video customization.

Abstract

Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.
Paper Structure (31 sections, 10 equations, 13 figures, 5 tables)

This paper contains 31 sections, 10 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Customized multi-subject text-to-video generation results by VideoDreamer. Given multiple subjects and few images for each subject, our VideoDreamer can generate videos that contain the given subjects, with new events and background, etc., guided by the text.
  • Figure 2: Visual comparison, where we use FastComposer and VideoDreamer to generate 2 images with 2 random seeds with the given prompt.
  • Figure 3: VideoDreamer: Given a pretrained video generator containing a text encoder $E_T$ and U-Net with motion modules $\epsilon_{\theta,I,T}$, in the Disen-Mix Finetuning, we finetune $E_T$ and the image modules $\epsilon_{\theta,I}$, where the separate-prompt finetuning is to customize each subject independently, while the disentangled finetuning for mixed data tackles the attribute binding problem. After finetuning, we obtain $E'_T$ and $\epsilon_{\theta,I'}$, which can be used to generate customized videos for multiple subjects. Additionally, we present a motion customization method, where we finetune the whole base text-to-video model on the reference video, and only use the finetuned motion modules $\epsilon_{\theta,T_m}$ together with image finetuned $E'_T$ and $\epsilon_{\theta,I'}$ to obtain videos with both customized motion and customized subjects.
  • Figure 4: Generated video frames only using separate-prompt finetuning, and the results are with 2 different random seeds. Only with the separate-prompt finetuning, the attributes of different subjects are mixed together. Sometimes one subject is missing.
  • Figure 5: Part of the MultiStudioBench dataset images.
  • ...and 8 more figures