Table of Contents
Fetching ...

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

Hengjia Li, Lifan Jiang, Xi Xiao, Tianyang Wang, Hongwei Yi, Boxi Wu, Deng Cai

TL;DR

MagicID tackles identity drift and reduced dynamics in personalized video generation from limited reference images by replacing self-reconstruction with direct preference optimization. It introduces a hybrid data strategy to create pairwise preference data, coupled with a two-stage sampling scheme that first prioritizes identity and then dynamics via Pareto-frontier based selection, and optimizes with a DPO-inspired objective extended to video via a hybrid loss $\mathcal{L}_{HPO}$. The approach yields superior identity fidelity and natural motion compared to state-of-the-art methods, validated through quantitative metrics, qualitative analysis, and user studies, and demonstrates practical potential for high-quality personalized video content. While effective, it currently focuses on single-identity scenarios, with plans to extend to multi-identity generation in future work.

Abstract

Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $\textbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

TL;DR

MagicID tackles identity drift and reduced dynamics in personalized video generation from limited reference images by replacing self-reconstruction with direct preference optimization. It introduces a hybrid data strategy to create pairwise preference data, coupled with a two-stage sampling scheme that first prioritizes identity and then dynamics via Pareto-frontier based selection, and optimizes with a DPO-inspired objective extended to video via a hybrid loss . The approach yields superior identity fidelity and natural motion compared to state-of-the-art methods, validated through quantitative metrics, qualitative analysis, and user studies, and demonstrates practical potential for high-quality personalized video content. While effective, it currently focuses on single-identity scenarios, with plans to extend to multi-identity generation in future work.

Abstract

Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce , a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.

Paper Structure

This paper contains 20 sections, 27 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Results of MagicID. Given a few reference images, our method is capable of generating highly realistic and personalized videos that maintain consistent identity features while exhibiting natural and visually appealing motion dynamics.
  • Figure 2: Analysis of identity degratation and dynamic reduction. (a) We compute the mean identity similarity with the reference images for generated videos of different lengths. As shown, traditional approaches suffer from diminished identity consistency as video length increases. In contrast, our method maintains strong identity robustness throughout prolonged video generations. (b) We calculate the dynamic degree for different training steps. As the customization progresses, traditional methods experience a gradual loss of motion dynamic during customization, whereas our method preserves original video dynamics across the entire training.
  • Figure 3: Overview of pairwise preference video data construction. In Step 1, we construct a preference video repository using videos generated by fine-tuned and Initial T2V models, along with static videos derived from reference images. In Step 2, we evaluate each video sequentially based on ID consistency using ID Encoder deng2019arcface, dynamic degree using optical flow huang2024vbench, and prompt following using VLMbai2023qwenvlversatilevisionlanguagemodel. In Step 3, we perform Hybrid Pair Selection, first selecting pairs based on ID consistency differences with a pre-defined dynamic threshold to address identity inconsistency, then selecting pairs based on both dynamic and identity to mitigate the dynamic reduction.
  • Figure 4: Qualitative comparison with tuning-based methods. As observed, both Dreambooth and MagicMe suffer from inferior ID fidelity, while our method maintains consistent identity and natural dynamics.
  • Figure 5: Qualitative comparison with tuning-based methods. As shown, ID-Animator suffers from poor identity consistency and video quality. While ConsisID improves identity fidelity to some extent, it exhibits severe copy-paste artifacts, demonstrating unnatural motion dynamics and text alignment, as seen in the last example with the helmet. In contrast, our method achieves strong performance in identity consistency, motion dynamics, and text alignment, significantly outperforming the baseline approaches.
  • ...and 8 more figures