Table of Contents
Fetching ...

Face0: Instantaneously Conditioning a Text-to-Image Model on a Face

Dani Valevski, Danny Wasserman, Yossi Matias, Yaniv Leviathan

TL;DR

Face0 introduces instant face-conditioned generation for diffusion-based image synthesis by training a lightweight face-embedding projection that conditions Stable Diffusion alongside textual prompts. The method pairs a face-embedding module with a small MLP to map to CLIP space and jointly trains with the diffusion network, enabling inference in seconds without per-subject fine-tuning. It offers fine-grained controllability through text-face conditioning, supports consistent character generation, and investigates potential bias mitigation by decoupling facial features from textual cues. While effective, the authors acknowledge risks around biases and identity preservation, and suggest directions like multi-face conditioning and broader domain applications for future work.

Abstract

We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face, in sample time, without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model, on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate images, given a user-supplied face image and a prompt, in just a couple of seconds. Our method achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using a fixed random vector instead of a face embedding from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, while requiring further research, we hope that our method, which decouples the model's textual biases from its biases on faces, might be a step towards some mitigation of biases in future text-to-image models.

Face0: Instantaneously Conditioning a Text-to-Image Model on a Face

TL;DR

Face0 introduces instant face-conditioned generation for diffusion-based image synthesis by training a lightweight face-embedding projection that conditions Stable Diffusion alongside textual prompts. The method pairs a face-embedding module with a small MLP to map to CLIP space and jointly trains with the diffusion network, enabling inference in seconds without per-subject fine-tuning. It offers fine-grained controllability through text-face conditioning, supports consistent character generation, and investigates potential bias mitigation by decoupling facial features from textual cues. While effective, the authors acknowledge risks around biases and identity preservation, and suggest directions like multi-face conditioning and broader domain applications for future work.

Abstract

We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face, in sample time, without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model, on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate images, given a user-supplied face image and a prompt, in just a couple of seconds. Our method achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using a fixed random vector instead of a face embedding from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, while requiring further research, we hope that our method, which decouples the model's textual biases from its biases on faces, might be a step towards some mitigation of biases in future text-to-image models.
Paper Structure (16 sections, 3 equations, 9 figures, 2 tables)

This paper contains 16 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The training scheme for Face0 (see \ref{['sec:method']}). Everything except the dashed red arrows is part of the standard diffusion model training procedure. For simplicity, we omit the details of converting from pixel space to latent space.
  • Figure 2: Samples for the prompts "A stock photo of X" for X in $\{"doctor", "CEO", "programmer"\}$ from the base model (left) and our model with a random face embedding (right).
  • Figure 3: Our model allows overriding features from the face embedding via the textual prompt.
  • Figure 4: Face0 enables control of facial features that are harder to describe textually via direct manipulation of the face embedding. Here we see simple linear interpolation between the left and right faces.
  • Figure 5: Face0 enables fine-grained control of facial features that are harder to describe textually via direct manipulation of the face embedding. Here we see a simple linear interpolation between the facial embeddings of two generated photos from the same source (the top-left image in \ref{['fig:prompt_control']}) with different textual prompts.
  • ...and 4 more figures