Table of Contents
Fetching ...

Digi2Real: Bridging the Realism Gap in Synthetic Data Face Recognition via Foundation Models

Anjith George, Sebastien Marcel

TL;DR

The paper tackles the realism gap in synthetic face data for face recognition by introducing Digi2Real, a realism-transfer framework that reuses DigiFace identities and enhances realism through Arc2Face-based generation, CLIP-space alignment with a learned offset $\Delta$, and SLERP-driven intra-class variation. This hybrid approach combines a graphics-rendering pipeline with foundation-model-backed realism to produce Digi2Real-20K, a synthetic dataset that yields substantial improvements over DigiFace and competitive results relative to state-of-the-art synthetic datasets, especially on IJB-B and IJB-C benchmarks. A key finding is that modest real-data augmentation (e.g., 1,000–% identities) can further close the gap between synthetic- and real-data training, suggesting a practical path for privacy-preserving FR systems. The work demonstrates the potential of realism transfer to enable large-scale, controllable synthetic data with strong downstream performance, and it provides public code and data resources to foster further research.

Abstract

The accuracy of face recognition systems has improved significantly in the past few years, thanks to the large amount of data collected and advancements in neural network architectures. However, these large-scale datasets are often collected without explicit consent, raising ethical and privacy concerns. To address this, there have been proposals to use synthetic datasets for training face recognition models. Yet, such models still rely on real data to train the generative models and generally exhibit inferior performance compared to those trained on real datasets. One of these datasets, DigiFace, uses a graphics pipeline to generate different identities and intra-class variations without using real data in model training. However, the performance of this approach is poor on face recognition benchmarks, possibly due to the lack of realism in the images generated by the graphics pipeline. In this work, we introduce a novel framework for realism transfer aimed at enhancing the realism of synthetically generated face images. Our method leverages the large-scale face foundation model, and we adapt the pipeline for realism enhancement. By integrating the controllable aspects of the graphics pipeline with our realism enhancement technique, we generate a large amount of realistic variations, combining the advantages of both approaches. Our empirical evaluations demonstrate that models trained using our enhanced dataset significantly improve the performance of face recognition systems over the baseline. The source code and dataset will be publicly accessible at the following link: https://www.idiap.ch/paper/digi2real

Digi2Real: Bridging the Realism Gap in Synthetic Data Face Recognition via Foundation Models

TL;DR

The paper tackles the realism gap in synthetic face data for face recognition by introducing Digi2Real, a realism-transfer framework that reuses DigiFace identities and enhances realism through Arc2Face-based generation, CLIP-space alignment with a learned offset , and SLERP-driven intra-class variation. This hybrid approach combines a graphics-rendering pipeline with foundation-model-backed realism to produce Digi2Real-20K, a synthetic dataset that yields substantial improvements over DigiFace and competitive results relative to state-of-the-art synthetic datasets, especially on IJB-B and IJB-C benchmarks. A key finding is that modest real-data augmentation (e.g., 1,000–% identities) can further close the gap between synthetic- and real-data training, suggesting a practical path for privacy-preserving FR systems. The work demonstrates the potential of realism transfer to enable large-scale, controllable synthetic data with strong downstream performance, and it provides public code and data resources to foster further research.

Abstract

The accuracy of face recognition systems has improved significantly in the past few years, thanks to the large amount of data collected and advancements in neural network architectures. However, these large-scale datasets are often collected without explicit consent, raising ethical and privacy concerns. To address this, there have been proposals to use synthetic datasets for training face recognition models. Yet, such models still rely on real data to train the generative models and generally exhibit inferior performance compared to those trained on real datasets. One of these datasets, DigiFace, uses a graphics pipeline to generate different identities and intra-class variations without using real data in model training. However, the performance of this approach is poor on face recognition benchmarks, possibly due to the lack of realism in the images generated by the graphics pipeline. In this work, we introduce a novel framework for realism transfer aimed at enhancing the realism of synthetically generated face images. Our method leverages the large-scale face foundation model, and we adapt the pipeline for realism enhancement. By integrating the controllable aspects of the graphics pipeline with our realism enhancement technique, we generate a large amount of realistic variations, combining the advantages of both approaches. Our empirical evaluations demonstrate that models trained using our enhanced dataset significantly improve the performance of face recognition systems over the baseline. The source code and dataset will be publicly accessible at the following link: https://www.idiap.ch/paper/digi2real

Paper Structure

This paper contains 8 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The images on the left show an example identity from the DigiFace dataset alongside its realism-enhanced versions, illustrating intra-class variations. On the right, the first row showcases original images from the DigiFace dataset, while the second row presents the corresponding transformed images generated using our approach.
  • Figure 2: Different stages of the proposed generation pipeline: We start with the original images from DigiFace and generate a class prototype and intra-class variations. A pre-trained Arc2Face model for generating identity-conditioned images, we add the CLIP shift to enhance realism.
  • Figure 3: T-SNE plots of samples from the FFHQ and DigiFace datasets in the CLIP latent space show a clear difference in the distribution of embeddings between the two datasets.
  • Figure 4: The set of images on the left shows images from DigiFace, the middle set shows images generated using identity-preserving sampling, and the right set shows images generated with CLIP shift added.
  • Figure 5: Intra-identity sampling is performed using spherical linear interpolation on the unit sphere, from the class prototype to the direction of the image samples in DigiFace.
  • ...and 2 more figures