Table of Contents
Fetching ...

ID-Booth: Identity-consistent Face Generation with Diffusion Models

Darian Tomašević, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir Štruc, Peter Peer

TL;DR

ID-Booth tackles identity consistency in diffusion-based face generation by fine-tuning pretrained diffusion models with a triplet identity objective and a three-component architecture (denoising network, VAE latent space, text encoder). The method uses LoRA to preserve prior synthesis capabilities while enabling identity-specific generation and introduces a total loss that combines reconstruction, prior preservation, and triplet identity terms. It demonstrates improved intra-identity consistency, inter-identity separability, and data diversity, enabling privacy-preserving augmentation that enhances recognition performance on multiple benchmarks. The work offers practical, consent-based synthetic data generation for small-scale datasets and discusses ethical considerations and potential extensions to larger-scale conditioning and demographics coverage.

Abstract

Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.

ID-Booth: Identity-consistent Face Generation with Diffusion Models

TL;DR

ID-Booth tackles identity consistency in diffusion-based face generation by fine-tuning pretrained diffusion models with a triplet identity objective and a three-component architecture (denoising network, VAE latent space, text encoder). The method uses LoRA to preserve prior synthesis capabilities while enabling identity-specific generation and introduces a total loss that combines reconstruction, prior preservation, and triplet identity terms. It demonstrates improved intra-identity consistency, inter-identity separability, and data diversity, enabling privacy-preserving augmentation that enhances recognition performance on multiple benchmarks. The work offers practical, consent-based synthetic data generation for small-scale datasets and discusses ethical considerations and potential extensions to larger-scale conditioning and demographics coverage.

Abstract

Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.

Paper Structure

This paper contains 11 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Samples generated with the proposed ID-Booth framework. The framework enables fine-tuning of pretrained diffusion models for generating diverse identity-consistent face images based on images gathered in a constrained setting with the consent of subjects.
  • Figure 2: Overview of the ID-Booth framework. The framework utilizes three training objectives to fine-tune a pretrained diffusion model. $\mathcal{L}_{REC}$ and $\mathcal{L}_{PR}$ are aimed at the reconstruction of training and prior images. Differently, the proposed triplet identity objective $\mathcal{L}_{TID}$ focuses on the identity similarity between generated samples and both training and prior samples to improve identity consistency without impacting the capabilities of the pretrained model.
  • Figure 3: Comparison of generated image samples. ID-Booth facilitates better identity consistency than DreamBooth ruiz2023dreambooth and better image diversity than when utilizing the PortraitBooth peng2024portraitbooth identity objective, which can limit the variety of facial features and poses.
  • Figure 4: Comparison of identity consistency. ID-Booth achieves better identity consistency than DreamBooth ruiz2023dreambooth, while retaining more diverse synthesis capabilities and ensuring better intra-identity diversity than PortraitBooth peng2024portraitbooth. Reported is the cosine similarity of synthetic and real identity features extracted with the pretrained ArcFace recognition model deng2019arcface.
  • Figure 5: Ablation study of ID-Booth training objectives. Shown are sample images generated by the ID-Booth framework trained with different training objectives. Training only with $\mathcal{L}_{REC}$ generates images similar to the training set, disregarding the given prompts. $\mathcal{L}_{PR}$ improves the diversity of samples but lowers identity consistency. Our proposed $\mathcal{L}_{TID}$ presents an objective that improves both diversity and identity consistency.
  • ...and 1 more figures