Table of Contents
Fetching ...

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yujie Wei, Zekun Li, Yingya Zhang, Boxi Wu, Deng Cai

TL;DR

PersonalVideo tackles identity-specific high-fidelity video customization with minimal reference data by addressing the tuning-inference gap in text-to-video generation. It replaces reconstruction-based supervision with a non-reconstructive reward framework comprising Identity Consistency Reward and Semantic Consistency Reward, plus simulated prompt augmentation and an Isolated Identity Adapter to preserve dynamics. The approach directly optimizes generated videos, aligning identity with reference while maintaining the original T2V's motion and semantic distribution. Experimental results on Diffusion-based backbones show superior identity fidelity and preserved dynamics, with robustness to single-image references and compatibility with LoRAs, indicating practical potential for scalable, flexible video personalization.

Abstract

The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed $\textbf{PersonalVideo}$, that applies a mixture of reward supervision on synthesized videos instead of the simple reconstruction objective on images. Specifically, we first incorporate identity consistency reward to effectively inject the reference's identity without the tuning-inference gap. Then we propose a novel semantic consistency reward to align the semantic distribution of the generated videos with the original T2V model, which preserves its dynamic and semantic following capability during the identity injection. With the non-reconstructive reward training, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image. Extensive experiments demonstrate our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior methods.

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

TL;DR

PersonalVideo tackles identity-specific high-fidelity video customization with minimal reference data by addressing the tuning-inference gap in text-to-video generation. It replaces reconstruction-based supervision with a non-reconstructive reward framework comprising Identity Consistency Reward and Semantic Consistency Reward, plus simulated prompt augmentation and an Isolated Identity Adapter to preserve dynamics. The approach directly optimizes generated videos, aligning identity with reference while maintaining the original T2V's motion and semantic distribution. Experimental results on Diffusion-based backbones show superior identity fidelity and preserved dynamics, with robustness to single-image references and compatibility with LoRAs, indicating practical potential for scalable, flexible video personalization.

Abstract

The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed , that applies a mixture of reward supervision on synthesized videos instead of the simple reconstruction objective on images. Specifically, we first incorporate identity consistency reward to effectively inject the reference's identity without the tuning-inference gap. Then we propose a novel semantic consistency reward to align the semantic distribution of the generated videos with the original T2V model, which preserves its dynamic and semantic following capability during the identity injection. With the non-reconstructive reward training, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image. Extensive experiments demonstrate our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior methods.

Paper Structure

This paper contains 22 sections, 6 equations, 24 figures, 4 tables.

Figures (24)

  • Figure 1: Results of PersonalVideo. Given the reference images of a specific identity, PersonalVideo can generate high ID-fidelity videos with promising motion dynamics and prompt following.
  • Figure 2: Analysis of the tuning-inference gap. Previous T2V customization supervises the tuning process via reconstructing images on T2I models, suffering from a tuning-inference gap. Differently, we aim to directly apply the supervision on generated videos, which aligns with inference and bridges the gap.
  • Figure 3: Overview of the framework of PersonalVideo. To bridge the tuning-inference gap, we directly apply reward supervision on generated videos starting from pure noises, including identity consistency reward with the reference and semantic consistency reward with the original video. During the optimization, we adopt simulated prompt sampled from the Large Language Model to supervise generated results in more semantic scenarios.
  • Figure 4: Visualization of the video denoising steps. The motion of the person, e.g., his hand, is formed in early stages of the denoising process. the later steps focus on the recovering of the detailed appearance.
  • Figure 5: Qualitative comparison for a few references. As observed, both Dreambooth and MagicMe suffer from inferior ID fidelity. In contrast, our PersonalVideo maintains high ID fidelity and preserve the original motion dynamics and semantic following, significantly surpassing others.
  • ...and 19 more figures