MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers
Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia
TL;DR
MagicMirror introduces a zero-shot ID-preserving video generation framework built on Video Diffusion Transformers. It leverages a dual-branch facial feature extractor and a lightweight Conditioned Adaptive Normalization (CAN) adapter to inject identity cues into a DiT backbone, paired with a two-stage training regime using synthetic identity data and video data. The approach achieves strong identity consistency and natural facial motion, outperforming state-of-the-art ID-preserving and I2V methods while adding minimal parameter overhead. This work advances personalized video synthesis by enabling identity-maintained dynamic storytelling without per-identity fine-tuning.
Abstract
We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that MagicMirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available.
