Table of Contents
Fetching ...

ExpPortrait: Expressive Portrait Generation via Personalized Representation

Junyi Wang, Yudong Guo, Boyang Guo, Shengming Yang, Juyong Zhang

TL;DR

A high-fidelity personalized head representation is proposed that more effectively disentangles expression and identity and outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

Abstract

While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

ExpPortrait: Expressive Portrait Generation via Personalized Representation

TL;DR

A high-fidelity personalized head representation is proposed that more effectively disentangles expression and identity and outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.

Abstract

While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
Paper Structure (20 sections, 15 equations, 8 figures, 1 table)

This paper contains 20 sections, 15 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: ExpPortrait utilizes a personalized head representation for portrait animation, achieving video generation with high consistency and high-fidelity. This stands in contrast to methods like Follow-Your-Emoji ma2024follow, which are constrained by low-rank and smooth intermediate representations.
  • Figure 2: Our framework. To address the limited decoupling capability and insufficient expressiveness of current parametric head representations, we propose a personalized head representation. Starting from the SMPL-X base mesh, we perform joint optimization learning of two complementary static and dynamic offset fields. We then construct an identity-adaptive expression transfer module to achieve cross-identity expression transfer. Using our head representation as a control signal, we guide a diffusion model for highly consistent and expressive portrait video generation.
  • Figure 3: An illustration of our optimization pipeline for transforming a generic SMPL-X mesh into our highly detailed, personalized head representation.
  • Figure 4: Design of our Expression Transfer Module.
  • Figure 5: Qualitative results in self-reenactment. Compared to other methods, our method can reveal more details about identity and facial expressions.
  • ...and 3 more figures