Table of Contents
Fetching ...

Subject-driven Video Generation via Disentangled Identity and Motion

Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo

TL;DR

This work tackles efficient zero-shot subject-driven video generation by disentangling identity learning from temporal dynamics. It introduces a dual-task training regime that injects subject identity from paired subject-images while preserving motion priors from a small set of unpaired videos, using stochastic task-switching with $p=0.2$. The approach relies on a CogVideoX-5B backbone with LoRA adapters and a simple $2$-task objective, avoiding large-scale paired subject-video data and achieving strong subject fidelity and temporal coherence within about $288$ A100-hours. Gradient-analysis shows the identity and motion objectives converge to near-orthogonal update directions, explaining stability without gradient-surgery and enabling competitive zero-shot performance against heavily tuned baselines. Practically, the method reduces data and compute requirements for SDV-Gen while maintaining high-quality, identity-consistent video generation across unseen subjects.

Abstract

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

Subject-driven Video Generation via Disentangled Identity and Motion

TL;DR

This work tackles efficient zero-shot subject-driven video generation by disentangling identity learning from temporal dynamics. It introduces a dual-task training regime that injects subject identity from paired subject-images while preserving motion priors from a small set of unpaired videos, using stochastic task-switching with . The approach relies on a CogVideoX-5B backbone with LoRA adapters and a simple -task objective, avoiding large-scale paired subject-video data and achieving strong subject fidelity and temporal coherence within about A100-hours. Gradient-analysis shows the identity and motion objectives converge to near-orthogonal update directions, explaining stability without gradient-surgery and enabling competitive zero-shot performance against heavily tuned baselines. Practically, the method reduces data and compute requirements for SDV-Gen while maintaining high-quality, identity-consistent video generation across unseen subjects.

Abstract

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

Paper Structure

This paper contains 72 sections, 4 theorems, 52 equations, 16 figures, 12 tables, 1 algorithm.

Key Result

Proposition 4.1

Consider two loss functions $\mathcal{L}_1$ and $\mathcal{L}_2$ that are $L$-smooth and locally convex in a neighborhood of convergence, and the stochastic update rule with a sufficiently small step size $\eta$. If the initial gradients exhibit non-trivial conflict, i.e., $\langle \nabla \mathcal{L}_1(\theta_0), \nabla \mathcal{L}_2(\theta_0)\rangle \neq 0$, then under the above assumptions the e

Figures (16)

  • Figure 1: Dual-task learning strategy. We formulate subject-driven video generation as a dual-task problem. First is identity injection (Bottom) from paired subject-images, and second is motion-awareness preservation (Left), which we utilize unpaired videos and conduct stochastically switched learning.
  • Figure 2: Training and Inference Details.Left: During training, we stochastically alternate between two objectives: identity injection using paired subject-images and motion-awareness preservation using a small set of unpaired videos. Right: At inference time, no additional per-subject tuning is required. The model generates a video conditioned on the reference image and text prompt in a zero-shot manner.
  • Figure 3: Limitation of SDI-Gen$\rightarrow$I2V method. With the subject presented small in the first frame, I2V fails to generate consistent results as it cannot interpret low-resolution subjects.
  • Figure 4: Gradient analysis on alignment and norms during fine-tuning.Left: Cosine similarity $\phi(t)$ between $g_{\text{img}}$ and $g_{\text{vid}}$ (over trainable parameters) quickly collapses to a narrow band near zero under dual-task training, indicating emergent near-orthogonality. Right:$\ell_2$ norms $\|g_{\text{img}}(t)\|_2$ and $\|g_{\text{vid}}(t)\|_2$ remain non-negligible and similar in scale after 100-step.
  • Figure 5: Qualitative comparison with zero-shot methods (left) and per-subject tuning methods (right). Ours mini denotes our model fine-tuned with 4,000 subset of Subject-200K ominicontrol. Note that ours is zero-shot tuning-free, requiring no tuning at inference time.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Proposition 4.1: Convergence of gradient inner product
  • Proposition C.3.1: Local model behind Proposition 4.1
  • proof
  • Proposition C.4.1: Image-only fine-tuning
  • proof
  • Lemma C.5.1: Difference between PCGrad and mixture update
  • proof