VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models
Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Yaofei Wu, Wenwu Zhu
TL;DR
This work tackles customized multi-subject text-to-video generation, a problem where multiple user-defined subjects must be preserved while following a textual prompt. It introduces VideoDreamer, which builds on a Stable Diffusion-based base with temporal modules, and employs Disen-Mix Finetuning to address attribute binding across subjects, complemented by optional Human-in-the-Loop Re-finetuning and a disentangled motion customization pipeline. The authors present MultiStudioBench, a dedicated benchmark with 25 subjects and multiple subject combinations to evaluate subject fidelity, prompt fidelity, temporal consistency, and artifacts, demonstrating that VideoDreamer achieves superior performance and fewer stitches than baselines. Ablation studies show the value of disentangled embeddings, the weak denoising objective, and the combination of separate and mixed data finetuning, while also revealing limitations for higher-order multi-subject scenarios and dynamic backgrounds. Overall, the approach advances practical multi-subject T2V by enabling new events and motions with preserved identities, suggesting promising directions for scalable, fine-grained video customization.
Abstract
Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.
