Table of Contents
Fetching ...

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia

TL;DR

MagicMirror introduces a zero-shot ID-preserving video generation framework built on Video Diffusion Transformers. It leverages a dual-branch facial feature extractor and a lightweight Conditioned Adaptive Normalization (CAN) adapter to inject identity cues into a DiT backbone, paired with a two-stage training regime using synthetic identity data and video data. The approach achieves strong identity consistency and natural facial motion, outperforming state-of-the-art ID-preserving and I2V methods while adding minimal parameter overhead. This work advances personalized video synthesis by enabling identity-maintained dynamic storytelling without per-identity fine-tuning.

Abstract

We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that MagicMirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available.

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

TL;DR

MagicMirror introduces a zero-shot ID-preserving video generation framework built on Video Diffusion Transformers. It leverages a dual-branch facial feature extractor and a lightweight Conditioned Adaptive Normalization (CAN) adapter to inject identity cues into a DiT backbone, paired with a two-stage training regime using synthetic identity data and video data. The approach achieves strong identity consistency and natural facial motion, outperforming state-of-the-art ID-preserving and I2V methods while adding minimal parameter overhead. This work advances personalized video synthesis by enabling identity-maintained dynamic storytelling without per-identity fine-tuning.

Abstract

We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that MagicMirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available.
Paper Structure (51 sections, 8 equations, 20 figures, 6 tables)

This paper contains 51 sections, 8 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: MagicMirror generates text-to-video results given the ID reference image. Complete videos are available in https://julianjuaner.github.io/projects/MagicMirror/.
  • Figure 2: MagicMirror generates dynamic facial motion. ID-Animator he2024id and Video Ocean luchen2024ocean exhibit limited motion range due to a strong identity-preservation constraint. MagicMirror achieves more dynamic facial expressions while maintaining reference identity fidelity.
  • Figure 3: Overview of MagicMirror. The framework employs a dual-branch feature extraction system with ID and face perceivers, followed by a cross-modal adapter (illustrated in \ref{['fig:adapter']}) for DiT-based video generation. By optimizing trainable modules marked by the flame, our method efficiently integrates facial features for controlled video synthesis while maintaining model efficiency.
  • Figure 4: Cross-modal adapter in DiT blocks. Top: Cross-modal modulation in mmDiTs. Bottom: The Conditioned Adaptive Normalization (CAN) for modal-specific feature modulation and decoupled attention integration.
  • Figure 5: Overview of our training datasets. The pipeline includes image pre-training data (A-C) and video post-training data (D). We utilize both self-reference data (A, B) and filtered synthesized pairs with the same identity (C, D). Numbers of (images + synthesized images) are reported.
  • ...and 15 more figures