DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Hong Chen; Xin Wang; Yipeng Zhang; Yuwei Zhou; Zeyang Zhang; Siao Tang; Wenwu Zhu

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

TL;DR

DisenStudio tackles the problem of customized multi-subject text-to-video generation from few-shot subject images by introducing a spatially disentangled cross-attention mechanism to correctly bind actions to the corresponding subjects and a motion preserved disentangled finetuning strategy to maintain both appearance fidelity and temporal dynamics. The framework combines multi-subject co occurrence data synthesis, masked single-subject finetuning, and motion-aware fine tuning to achieve robust multi-subject generation with high subject fidelity, textual alignment, and temporal consistency. Experimental results on the proposed DisenStudioBench show significant improvements over VideoDreamer and DreamBooth/ CustomDiffusion baselines in objective metrics and human judgments, along with ablations confirming the value of each component. The approach enables precise, controllable multi-subject video generation and opens avenues for broader controllable video synthesis tasks.

Abstract

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 13 figures, 3 tables)

This paper contains 37 sections, 9 equations, 13 figures, 3 tables.

Introduction
Related Work
Text-to-image diffusion models
Text-to-video generation
Text-guided video editing
Subject customization
Methodology
Preliminaries
Stable Diffusion
AnimateDiff
A Naive Approach
DisenStudio
Spatial-disentangled cross-attention
Motion-preserved Disentangled Finetuning
Joint optimization
...and 22 more sections

Figures (13)

Figure 1: Illustration for customized multi-subject text-to-video generation.
Figure 2: The proposed DisenStudio framework is based on the AnimateDiff model that includes the text encoder, and U-Net with temporal modules. Given few images of each subject, (A) we synthesize the multi-subject co-occurrence data with randomly generated background and segmented subjects. (B) we generate images where different subjects take a randomly sampled action, which is used to maintain the motion-generation ability of the model. (C) we finetune the U-Net and text encoder with LoRA, on the synthesized co-occurrence data and generated motion prior data. (D) we insert the temporal modules to U-Net and conduct video generation with the spatial-disentangled cross-attention.
Figure 3: Video frames generated by the naive approach.
Figure 4: Comparison between the Spatial-disentangled cross-attention and the vanilla cross-attention.
Figure 5: Generated images from pretrained Stable Diffusion with SDCA are with the continuous background.
...and 8 more figures

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

TL;DR

Abstract

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Authors

TL;DR

Abstract

Table of Contents

Figures (13)