Table of Contents
Fetching ...

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma

TL;DR

EmojiDiff presents an end-to-end framework for simultaneous, fine-grained RGB-level expression control and high-fidelity identity preservation in portrait generation. It introduces a two-stage training pipeline: ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression data and ID-enhanced Contrast Alignment (ICA) for efficient fine-tuning, aided by the Adaptive Noise Inversion (ANI) technique. A pluggable E-Adapter with decoupled identity and expression branches enables robust control, while the new CIEP100k dataset facilitates expression–identity disentanglement research. Empirical results show superior expression fidelity and identity preservation across multiple diffusion backbones and styles, underscoring practical applicability in expressive portrait synthesis and related tasks.

Abstract

This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity: (i) fine expression control signals inevitably introduce appearance-related semantics (e.g., facial contours, and ratio), which impact the identity of the generated portrait; (ii) even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by dividing and optimizing the processes of maintaining expression and altering identity, thereby ensuring stable and high-quality data generation. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (e.g., identity, skin). Subsequently, we present ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

TL;DR

EmojiDiff presents an end-to-end framework for simultaneous, fine-grained RGB-level expression control and high-fidelity identity preservation in portrait generation. It introduces a two-stage training pipeline: ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression data and ID-enhanced Contrast Alignment (ICA) for efficient fine-tuning, aided by the Adaptive Noise Inversion (ANI) technique. A pluggable E-Adapter with decoupled identity and expression branches enables robust control, while the new CIEP100k dataset facilitates expression–identity disentanglement research. Empirical results show superior expression fidelity and identity preservation across multiple diffusion backbones and styles, underscoring practical applicability in expressive portrait synthesis and related tasks.

Abstract

This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity: (i) fine expression control signals inevitably introduce appearance-related semantics (e.g., facial contours, and ratio), which impact the identity of the generated portrait; (ii) even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by dividing and optimizing the processes of maintaining expression and altering identity, thereby ensuring stable and high-quality data generation. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (e.g., identity, skin). Subsequently, we present ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

Paper Structure

This paper contains 27 sections, 17 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Expression customization by different methods. (a) The methods extract the control signals from expressions and inject them into diffusion models. (b) The methods generate stylized images and manipulate them through animation. (c) The proposed method simultaneously incorporates portrait images and reference expressions, generating stylistic images in an end-to-end manner.
  • Figure 1: Different expression controller structure.
  • Figure 2: Overview of the proposed method. To integrate RGB-driven expression control into diffusion models, we aim to synthesize cross-identity data $\{\mathbf{S}_i^{\neg{e}}, \mathbf{R}_{\neg{i}}^e, \mathbf{T}_i^e\}$ for the model's decoupled training, and mitigate the negative impact on the original identity through contrastive alignment fine-tuning. Before decoupled training, the fundamental expression controller (i.e., Base E-Adapter) is trained with same-identity data $\{\mathbf{S}_i^{\neg{e}}, \mathbf{R}_{i}^e, \mathbf{T}_i^e\}$ to obtain expression transfer capabilities (the structure of the E-Adapter is illustrated in Fig. \ref{['fig:adapter']}). Next, the trained Base E-Adapter and FaceFusion FaceFusion are utilized to alter the identity of portraits, thereby creating cross-identity expression pairs $\{\mathbf{R}_{\neg{i}}^e, \mathbf{T}_i^e\}$. Subsequently, the Refined E-Adapter uses newly synthesized data for disentangled training, facilitating dual control of identity and expression without ID leakage. Finally, the Refined E-Adapter is fine-tuned by expression and identity loss based on ANI.
  • Figure 2: Images generated using different prompts.
  • Figure 3: The proposed E-Adapter. The embeddings of identity and expression images are obtained through respective branches. Subsequently, the ID embedding, text embedding, and expression embeddings are integrated into the network, as depicted in Eq. (\ref{['eq:eq6']}).
  • ...and 6 more figures