EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Liangwei Jiang; Ruida Li; Zhifeng Zhang; Shuo Fang; Chenguang Ma

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma

TL;DR

EmojiDiff presents an end-to-end framework for simultaneous, fine-grained RGB-level expression control and high-fidelity identity preservation in portrait generation. It introduces a two-stage training pipeline: ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression data and ID-enhanced Contrast Alignment (ICA) for efficient fine-tuning, aided by the Adaptive Noise Inversion (ANI) technique. A pluggable E-Adapter with decoupled identity and expression branches enables robust control, while the new CIEP100k dataset facilitates expression–identity disentanglement research. Empirical results show superior expression fidelity and identity preservation across multiple diffusion backbones and styles, underscoring practical applicability in expressive portrait synthesis and related tasks.

Abstract

This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity: (i) fine expression control signals inevitably introduce appearance-related semantics (e.g., facial contours, and ratio), which impact the identity of the generated portrait; (ii) even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by dividing and optimizing the processes of maintaining expression and altering identity, thereby ensuring stable and high-quality data generation. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (e.g., identity, skin). Subsequently, we present ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

TL;DR

Abstract

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)