Table of Contents
Fetching ...

Arc2Face: A Foundation Model for ID-Consistent Human Faces

Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, Stefanos Zafeiriou

TL;DR

The paper tackles the challenge of generating high-fidelity, ID-consistent human faces conditioned on identity embeddings. It introduces Arc2Face, a diffusion-based foundation approach that maps ArcFace embeddings into the CLIP latent space via a fine-tuned encoder while preserving identity without text prompts. Experiments show superior identity retention, diversity, and realism against CLIP- or FR-based baselines and enable high-resolution outputs (512x512) from a single ID vector. The work enables scalable ID-preserving face synthesis and synthetic-data generation for face recognition and downstream tasks, while recognizing ethical considerations and the need for detection mechanisms.

Abstract

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

Arc2Face: A Foundation Model for ID-Consistent Human Faces

TL;DR

The paper tackles the challenge of generating high-fidelity, ID-consistent human faces conditioned on identity embeddings. It introduces Arc2Face, a diffusion-based foundation approach that maps ArcFace embeddings into the CLIP latent space via a fine-tuned encoder while preserving identity without text prompts. Experiments show superior identity retention, diversity, and realism against CLIP- or FR-based baselines and enable high-resolution outputs (512x512) from a single ID vector. The work enables scalable ID-preserving face synthesis and synthetic-data generation for face recognition and downstream tasks, while recognizing ethical considerations and the need for detection mechanisms.

Abstract

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.
Paper Structure (27 sections, 1 equation, 20 figures, 2 tables)

This paper contains 27 sections, 1 equation, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Given the ID-embedding from deng2019arcface, Arc2Face can generate high-quality images of any subject with compelling similarity. Using popular extensions, such as ControlNet zhang2023adding, we can explicitly control facial attributes such as the pose or expression.
  • Figure 2: Overview of Arc2Face. We use a straightforward design to condition Stable Diffusion on ID features. The ArcFace embedding is processed by the text encoder using a frozen pseudo-prompt for compatibility, allowing projection into the CLIP latent space for cross-attention control. Both the encoder and UNet are optimized on a million-scale FR dataset zhu2021webface260m (after upsampling), followed by additional fine-tuning on high-quality datasets karras2017progressivekarras2019style, without any text annotations. The resulting model exclusively adheres to ID-embeddings, disregarding its initial language guidance.
  • Figure 3: ArcFace deng2019arcface similarity distributions between input and generated faces from LD models trained on the ID-to-image task. We use two different datasets of input IDs for evaluation (500 and 400 IDs respectively) and generate 5 images per ID. We compare models trained on three datasets: FFHQ, WebFace42M-10%, and WebFace42M.
  • Figure 4: Distribution of ArcFace similarity between input IDs, synthetic (a) or real (b), and generated images of them by different models. As all non-CLIP-based methods use deng2019arcface for conditioning, we evaluate them with deng2019arcface. For an evaluation with a different network, please refer to the Supp. Material, where similar observations can be made.
  • Figure 5: Percentage of user votes received by our method and InstantID re. ID fidelity.
  • ...and 15 more figures