Subject-driven Video Generation via Disentangled Identity and Motion
Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo
TL;DR
This work tackles efficient zero-shot subject-driven video generation by disentangling identity learning from temporal dynamics. It introduces a dual-task training regime that injects subject identity from paired subject-images while preserving motion priors from a small set of unpaired videos, using stochastic task-switching with $p=0.2$. The approach relies on a CogVideoX-5B backbone with LoRA adapters and a simple $2$-task objective, avoiding large-scale paired subject-video data and achieving strong subject fidelity and temporal coherence within about $288$ A100-hours. Gradient-analysis shows the identity and motion objectives converge to near-orthogonal update directions, explaining stability without gradient-surgery and enabling competitive zero-shot performance against heavily tuned baselines. Practically, the method reduces data and compute requirements for SDV-Gen while maintaining high-quality, identity-consistent video generation across unseen subjects.
Abstract
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.
