Table of Contents
Fetching ...

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng

TL;DR

The paper tackles the problem of personalized face generation by enabling simultaneous identity and fine-grained expression control within a specified background. It introduces a diffusion-based framework, DiffSFSR, that ingests a background prompt, a user selfie, and a fine-grained expression label (EmoFace135) to perform simultaneous face swapping and reenactment with explicit background conditioning. Key contributions include a balanced identity/expression encoding scheme with compound embeddings, an explicit background conditioning mechanism, and an improved midpoint sampling method, all implemented in a latent-diffusion backbone. Experimental results show high identity and expression fidelity, superior performance relative to text-to-image and existing face manipulation baselines, and strong user-study support, indicating practical potential for controllable, personalized portrait generation. The work advances multi-modal conditioning for diffusion models and broadens the expressive capability of identity-preserving face synthesis.

Abstract

In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

TL;DR

The paper tackles the problem of personalized face generation by enabling simultaneous identity and fine-grained expression control within a specified background. It introduces a diffusion-based framework, DiffSFSR, that ingests a background prompt, a user selfie, and a fine-grained expression label (EmoFace135) to perform simultaneous face swapping and reenactment with explicit background conditioning. Key contributions include a balanced identity/expression encoding scheme with compound embeddings, an explicit background conditioning mechanism, and an improved midpoint sampling method, all implemented in a latent-diffusion backbone. Experimental results show high identity and expression fidelity, superior performance relative to text-to-image and existing face manipulation baselines, and strong user-study support, indicating practical potential for controllable, personalized portrait generation. The work advances multi-modal conditioning for diffusion models and broadens the expressive capability of identity-preserving face synthesis.

Abstract

In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.
Paper Structure (13 sections, 9 equations, 47 figures, 4 tables)

This paper contains 13 sections, 9 equations, 47 figures, 4 tables.

Figures (47)

  • Figure 1: The proposed framework takes three inputs: a prompt describing the background, a selfie photo uploaded by the user, and a text related to the fine-grained expression labels. The generated faces well match the inputted triples and exhibit fine-grained expression synthesis.
  • Figure 2: Overview of the proposed face generation framework.
  • Figure 3: Pipeline of DiffSFSR, including training and inference phases. Although the diffusion model is practically trained and tested in the latent space StableDiffusion, we illustrate all the processes in the original image space for visualization. The transformations between the image space $\mathbf{x}$ and the latent space $\mathbf{z}$ are not illustrated for brevity.
  • Figure 4: The network architecture of the denoising UNet. QKV denotes the cross-attention layer.
  • Figure 5: A subset of 135 classes expression synthesis samples. Please zoom in for more details.
  • ...and 42 more figures