Table of Contents
Fetching ...

Face2Diffusion for Fast and Editable Face Personalization

Kaede Shiohara, Toshihiko Yamasaki

TL;DR

Face2Diffusion (F2D) for high-editability face personalization greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.

Abstract

Face personalization aims to insert specific faces, taken from images, into pretrained text-to-image diffusion models. However, it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper, we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information, which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised, which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.

Face2Diffusion for Fast and Editable Face Personalization

TL;DR

Face2Diffusion (F2D) for high-editability face personalization greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.

Abstract

Face personalization aims to insert specific faces, taken from images, into pretrained text-to-image diffusion models. However, it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper, we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information, which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised, which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.
Paper Structure (24 sections, 13 equations, 15 figures, 7 tables)

This paper contains 24 sections, 13 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Our Results. Face2Diffusion satisfies challenging text prompts that include multiple conditions while preserving input face identities without individual test-time tuning.
  • Figure 2: Typical overfitting to input data. The original StableDiffusion latentdiffusion is capable of generating text-aligned images with plausible backgrounds, camera poses, and diverse face expressions. Nevertheless, a previous method customdiffusion fails in disentangling these identity-irrelevant information from the input sample.
  • Figure 3: Overview of Face2Diffusion. (a) During training, we input a face image into our novel multi-scale identity encoder $f_{id}$ and an off-the-shell 3D face reconstruction model $f_{exp}$ to extract identity and expression features, respectively. The concatenated feature is projected into the text space as a word embedding $S^*$ by a mapping network $f_{map}$. The input image is also encoded by VAE's encoder $\mathcal{E}$ and then a Gaussian noise $\epsilon$ is added to it. We constrain the denoised latent feature map to be the original one in the foreground and to be a class-guided denoised result in the background. (b) During inference, the expression feature is replaced with an unconditional vector $\tilde{v}_{exp}$ to diversify face expressions of generated images. After injecting the face embedding $S^*$ into an input text, the original denoising loop of StableDiffusion is performed to generate an image conditioned by the input face identity and text.
  • Figure 4: 3rd layer of ArcFace
  • Figure 5: 12th layer of ArcFace
  • ...and 10 more figures