MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, Deng Cai
TL;DR
MagicID tackles identity drift and reduced dynamics in personalized video generation from limited reference images by replacing self-reconstruction with direct preference optimization. It introduces a hybrid data strategy to create pairwise preference data, coupled with a two-stage sampling scheme that first prioritizes identity and then dynamics via Pareto-frontier based selection, and optimizes with a DPO-inspired objective extended to video via a hybrid loss $\mathcal{L}_{HPO}$. The approach yields superior identity fidelity and natural motion compared to state-of-the-art methods, validated through quantitative metrics, qualitative analysis, and user studies, and demonstrates practical potential for high-quality personalized video content. While effective, it currently focuses on single-identity scenarios, with plans to extend to multi-identity generation in future work.
Abstract
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
